You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Boyang Chen <re...@gmail.com> on 2020/12/02 19:35:12 UTC

[DISCUSS] KIP-691: Transactional Producer Exception Handling

Hey there,

I would like to start a discussion thread for KIP-691:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling

The KIP is aiming to simplify the exception handling logic for
transactional Producer users by classifying fatal and non-fatal exceptions
and throw them correspondingly for easier catch and retry. Let me know what
you think.

Best,
Boyang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Thanks Jason!

On Tue, Feb 2, 2021 at 11:10 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Boyang, sounds good to me.
>
> -Jason
>
> On Mon, Feb 1, 2021 at 9:32 AM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Hey Jason and Chia-Ping,
> >
> > looking at the discussion thread for KIP-706, we are still in discussion
> > with the final API version, so I made the decision to decouple the two
> KIPs
> > so that this one could move forward. To be specific, we are still
> > introducing ProduceFailed error but as a `near-future` improvement, which
> > should be implemented right after KIP-706 is finalized.
> >
> > Let me know your thoughts.
> >
> > Best,
> > Boyang
> >
> > On Sat, Jan 30, 2021 at 11:57 AM Chia-Ping Tsai <ch...@apache.org>
> > wrote:
> >
> > > > a richer return type. Let me know if you are good with this, and
> > whether
> > > > Chia-Ping is also happy here :)
> > >
> > > sure. I'd like to join this bigger party :)
> > >
> > > On 2021/01/30 01:03:57, Boyang Chen <re...@gmail.com>
> wrote:
> > > > This is a great proposal Jason, I already integrated KIP-691 with
> > KIP-706
> > > > template to provide a new ProduceFailedException as well as a new
> > > > Producer#produce API with CompletionStage as old send API replacement
> > > with
> > > > a richer return type. Let me know if you are good with this, and
> > whether
> > > > Chia-Ping is also happy here :)
> > > >
> > > > Best,
> > > > Boyang
> > > >
> > > > On Fri, Jan 29, 2021 at 4:23 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > > >
> > > > > Hi Boyang,
> > > > >
> > > > > Ok, if we are going to ask all users to update their code anyway,
> > > maybe we
> > > > > can get more bang out of this.
> > > > >
> > > > > One of the remaining problems is that a user will only get the
> richer
> > > > > information if they are supplying a Callback. If they wait on the
> > > future,
> > > > > then they will get the usual exception. Our hands are somewhat tied
> > > here
> > > > > because of compatibility. One thing we have wanted to do for a long
> > > time is
> > > > > introduce a new send method which returns
> > > > > `CompletableFuture<RecordMetadata>` instead. If we did so, then we
> > > wouldn't
> > > > > need to change `Callback` at all. Instead we could make the
> contract
> > > of the
> > > > > new api that it will always return `SendFailedException`, which
> could
> > > > > provide a method to get the failure type.
> > > > >
> > > > > I think Chia-Ping has been thinking about this:
> > > > > https://issues.apache.org/jira/browse/KAFKA-12227. Is there a way
> to
> > > > > consolidate our efforts in order to avoid unnecessary noise for
> > users?
> > > It
> > > > > would be awesome to see a more modern API which also addresses the
> > > > > confusing error handling.
> > > > >
> > > > > -Jason
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jan 29, 2021 at 3:48 PM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks Jason, my hope is that we could streamline the callback
> > > function
> > > > > > signature, so that in the next major release (3.0),
> > > > > > onCompletion(RecordMetadata metadata, SendFailure sendFailure)
> will
> > > > > become
> > > > > > the only API to be implemented.
> > > > > >
> > > > > > On Fri, Jan 29, 2021 at 3:42 PM Jason Gustafson <
> > jason@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Boyang,
> > > > > > >
> > > > > > > Do you think it's necessary to deprecate the old `onCompletion`
> > > > > > callback? I
> > > > > > > was thinking there's probably no harm leaving it around. Users
> > > might
> > > > > not
> > > > > > > care about the failure type. Other than that, it looks good to
> > me.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > > On Thu, Jan 28, 2021 at 6:58 PM Boyang Chen <
> > > > > reluctanthero104@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Guozhang,
> > > > > > > >
> > > > > > > > the CommitFailedException wrapping would cover all the
> > non-fatal
> > > > > > > exceptions
> > > > > > > > as we listed out in the KIP, generally speaking any exception
> > > that
> > > > > > could
> > > > > > > > recover safely by calling abortTxn should be wrapped.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Boyang
> > > > > > > >
> > > > > > > > On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <
> > > wangguoz@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Boyang,
> > > > > > > > >
> > > > > > > > > I think maybe there's something cracking here :) I'm just
> > > asking
> > > > > for
> > > > > > > > > clarifications that as of today, which non-fatal exceptions
> > the
> > > > > newly
> > > > > > > > > introduced CommitFailedException would cover, and it seems
> to
> > > be
> > > > > only
> > > > > > > 1)
> > > > > > > > > unknown pid, 2) invalid pid mapping, and 3) concurrent
> > > > > transactions.
> > > > > > Is
> > > > > > > > > that correct?
> > > > > > > > >
> > > > > > > > > On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <
> > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey Guozhang, I think TimeoutException would not be
> covered
> > > here
> > > > > as
> > > > > > > it
> > > > > > > > > > potentially has a risk of hitting an illegal state on the
> > > broker
> > > > > > side
> > > > > > > > > when
> > > > > > > > > > the previous commit was actually successful. Users should
> > > try to
> > > > > > > > increase
> > > > > > > > > > their max.block.ms to avoid hitting the timeout as a
> base
> > > > > > > suggestion,
> > > > > > > > > > which
> > > > > > > > > > is discussed in the KIP.
> > > > > > > > > >
> > > > > > > > > > As for the umbrella exception, I agree it has values to
> > some
> > > > > > extent,
> > > > > > > > but
> > > > > > > > > in
> > > > > > > > > > terms of adoption difficulty, additional exception types
> > > usually
> > > > > > make
> > > > > > > > the
> > > > > > > > > > error handling more complex than simplified, and we are
> > > doing our
> > > > > > > best
> > > > > > > > to
> > > > > > > > > > avoid compatibility issues caused by wrapping previously
> > > thrown
> > > > > raw
> > > > > > > > > > exceptions that invalidates user error handling. And I'm
> > > also not
> > > > > > in
> > > > > > > > > favor
> > > > > > > > > > of our exception hierarchy today where all exceptions are
> > > > > > subclasses
> > > > > > > of
> > > > > > > > > > KafkaException, and we already mixed its handling with
> > other
> > > > > > specific
> > > > > > > > > > exception types that could be caught as superclass
> > > > > KafkaException.
> > > > > > > That
> > > > > > > > > > could be addressed in a separate proposal if we see
> > > necessary.
> > > > > > > > > >
> > > > > > > > > > Boyang
> > > > > > > > > >
> > > > > > > > > > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <
> > > > > wangguoz@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks Jason for the suggestion, that looks good to me
> > too.
> > > > > > > > > > >
> > > > > > > > > > > Regarding the non-fatal exceptions wrapped as
> > > CommitFailed, I
> > > > > > would
> > > > > > > > > like
> > > > > > > > > > to
> > > > > > > > > > > clarify if we would cover all the following cases: 1)
> > > timeout,
> > > > > 2)
> > > > > > > > > unknown
> > > > > > > > > > > pid, 3) invalid pid mapping, 4) concurrent
> transactions?
> > > > > > > > > > >
> > > > > > > > > > > BTW I think it still makes sense to use an umbrella
> > > exception
> > > > > in
> > > > > > > case
> > > > > > > > > in
> > > > > > > > > > > the future we add more non-fatal cases even though
> today
> > we
> > > > > only
> > > > > > > > have a
> > > > > > > > > > > few.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Guozhang
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <
> > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks Jason, I agree with the proposed solution
> here,
> > > will
> > > > > > > update
> > > > > > > > > the
> > > > > > > > > > > KIP.
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <
> > > > > > > > jason@confluent.io
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Boyang,
> > > > > > > > > > > > >
> > > > > > > > > > > > > It seems like a reasonable suggestion. I wonder if
> a
> > > flag
> > > > > is
> > > > > > > > > > sufficient
> > > > > > > > > > > > > though. The current `Callback` documentation treats
> > > "fatal"
> > > > > > > > errors
> > > > > > > > > > from
> > > > > > > > > > > > the
> > > > > > > > > > > > > perspective of the individual message that was
> sent.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ```
> > > > > > > > > > > > >     *                  Non-Retriable exceptions
> > > (fatal, the
> > > > > > > > message
> > > > > > > > > > > will
> > > > > > > > > > > > > never be sent):
> > > > > > > > > > > > > ```
> > > > > > > > > > > > >
> > > > > > > > > > > > > However, we also have fatal errors from the
> > > perspective of
> > > > > > the
> > > > > > > > > > > > transaction
> > > > > > > > > > > > > (e.g. when the producer gets fenced). Perhaps that
> > > suggests
> > > > > > we
> > > > > > > > need
> > > > > > > > > > > > > something richer than a boolean flag:
> > > > > > > > > > > > >
> > > > > > > > > > > > > At a high level, I think the following cases are
> > > possible:
> > > > > > > > > > > > >
> > > > > > > > > > > > > - message rejected (e.g. message too large, invalid
> > > topic)
> > > > > > > > > > > > > - delivery failed after retries/delivery timeout
> > (e.g.
> > > > > > timeout,
> > > > > > > > crc
> > > > > > > > > > > > error,
> > > > > > > > > > > > > not enough replicas)
> > > > > > > > > > > > > - transaction failed (e.g. producer fenced, invalid
> > > > > > transaction
> > > > > > > > > > state)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Perhaps instead we can have a type like the
> > following:
> > > > > > > > > > > > >
> > > > > > > > > > > > > class SendFailure {
> > > > > > > > > > > > >   FailureType failureType;
> > > > > > > > > > > > >   Exception cause;
> > > > > > > > > > > > > }
> > > > > > > > > > > > >
> > > > > > > > > > > > > enum FailureType {
> > > > > > > > > > > > >   MESSSAGE_REJECTED, DELIVERY_FAILED,
> > > TRANSACTION_FAILED
> > > > > > > > > > > > > }
> > > > > > > > > > > > >
> > > > > > > > > > > > > (Not married to any of these names, just a starting
> > > point.)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Then we add a new `onCompletion` as you've
> suggested:
> > > > > > > > > > > > >
> > > > > > > > > > > > > default void onCompletion(RecordMetadata metadata,
> > > > > > SendFailure
> > > > > > > > > > > failure) {
> > > > > > > > > > > > >   this.onCompletion(metadata, failure.cause());
> > > > > > > > > > > > > }
> > > > > > > > > > > > >
> > > > > > > > > > > > > This would give streams and other applications
> enough
> > > > > > > information
> > > > > > > > > to
> > > > > > > > > > > know
> > > > > > > > > > > > > whether the message can be retried and whether the
> > > > > > transaction
> > > > > > > > can
> > > > > > > > > be
> > > > > > > > > > > > > aborted.
> > > > > > > > > > > > >
> > > > > > > > > > > > > What do you think?
> > > > > > > > > > > > >
> > > > > > > > > > > > > -Jason
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks Jason for the thoughts.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson
> <
> > > > > > > > > > jason@confluent.io
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Boyang,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the iterations here. I think this is
> > > > > something
> > > > > > > we
> > > > > > > > > > should
> > > > > > > > > > > > > have
> > > > > > > > > > > > > > > done a long time ago. It sounds like there are
> > two
> > > API
> > > > > > > > changes
> > > > > > > > > > > here:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. We are introducing the
> `CommitFailedException`
> > > to
> > > > > wrap
> > > > > > > > > > abortable
> > > > > > > > > > > > > > > errors that are raised from
> `commitTransaction`.
> > > This
> > > > > > > sounds
> > > > > > > > > fine
> > > > > > > > > > > to
> > > > > > > > > > > > > me.
> > > > > > > > > > > > > > As
> > > > > > > > > > > > > > > far as I know, the only case we might need this
> > is
> > > when
> > > > > > we
> > > > > > > > add
> > > > > > > > > > > > support
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > let producers recover from coordinator
> timeouts.
> > > Are
> > > > > > there
> > > > > > > > any
> > > > > > > > > > > > others?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think the purpose here is to ensure non-fatal
> > > > > > exceptions
> > > > > > > > are
> > > > > > > > > > > > unified
> > > > > > > > > > > > > > under the same
> > > > > > > > > > > > > > exception umbrella, to make the proceeding to
> abort
> > > any
> > > > > > > ongoing
> > > > > > > > > > > > > transaction
> > > > > > > > > > > > > > much easier.
> > > > > > > > > > > > > > I don't think `coordinator timeouts` is the only
> > > case to
> > > > > > > > recover
> > > > > > > > > > > here,
> > > > > > > > > > > > > > since we have other
> > > > > > > > > > > > > > non-fatal exceptions such as UnknownPid.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2. We are wrapping non-fatal errors raised from
> > > `send` as
> > > > > > > > > > > > > `KafkaException`.
> > > > > > > > > > > > > > > The motivation for this is less clear to me and
> > it
> > > > > > doesn't
> > > > > > > > look
> > > > > > > > > > > like
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > example from the KIP depends on it. My concern
> > > here is
> > > > > > > > > > > compatibility.
> > > > > > > > > > > > > > > Currently we have the following documentation
> for
> > > the
> > > > > > > > > `Callback`
> > > > > > > > > > > API:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > >      *                  Non-Retriable
> exceptions
> > > > > (fatal,
> > > > > > > the
> > > > > > > > > > > message
> > > > > > > > > > > > > will
> > > > > > > > > > > > > > > never be sent):
> > > > > > > > > > > > > > >      *
> > > > > > > > > > > > > > >      *                  InvalidTopicException
> > > > > > > > > > > > > > >      *
> > > OffsetMetadataTooLargeException
> > > > > > > > > > > > > > >      *
> > > RecordBatchTooLargeException
> > > > > > > > > > > > > > >      *                  RecordTooLargeException
> > > > > > > > > > > > > > >      *                  UnknownServerException
> > > > > > > > > > > > > > >      *
> > UnknownProducerIdException
> > > > > > > > > > > > > > >      *
> > > InvalidProducerEpochException
> > > > > > > > > > > > > > >      *
> > > > > > > > > > > > > > >      *                  Retriable exceptions
> > > > > (transient,
> > > > > > > may
> > > > > > > > be
> > > > > > > > > > > > covered
> > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > increasing #.retries):
> > > > > > > > > > > > > > >      *
> > > > > > > > > > > > > > >      *                  CorruptRecordException
> > > > > > > > > > > > > > >      *
> InvalidMetadataException
> > > > > > > > > > > > > > >      *
> > > > > > > NotEnoughReplicasAfterAppendException
> > > > > > > > > > > > > > >      *
> > NotEnoughReplicasException
> > > > > > > > > > > > > > >      *
> OffsetOutOfRangeException
> > > > > > > > > > > > > > >      *                  TimeoutException
> > > > > > > > > > > > > > >      *
> > > > > UnknownTopicOrPartitionException
> > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If we wrap all the retriable exceptions
> > documented
> > > here
> > > > > > as
> > > > > > > > > > > > > > > `KafkaException`, wouldn't that break any error
> > > > > handling
> > > > > > > that
> > > > > > > > > > users
> > > > > > > > > > > > > might
> > > > > > > > > > > > > > > already have? it's gonna introduce a
> > compatibility
> > > > > issue.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The original intention was to simplify `send`
> > > callback
> > > > > > > error
> > > > > > > > > > > handling
> > > > > > > > > > > > > by
> > > > > > > > > > > > > > doing exception wrapping, as on Streams level
> > > > > > > > > > > > > > we have to prepare an exhausting list of
> exceptions
> > > to
> > > > > > catch
> > > > > > > as
> > > > > > > > > > > fatal,
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > the same lengthy list to catch as
> > > > > > > > > > > > > > non-fatal. It would be much easier if we got
> > `hints`
> > > from
> > > > > > the
> > > > > > > > > > > callback.
> > > > > > > > > > > > > > However,
> > > > > > > > > > > > > > I agree there is a compatibility concern, what
> > about
> > > > > > > > deprecating
> > > > > > > > > > the
> > > > > > > > > > > > > > existing:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > void onCompletion(RecordMetadata metadata,
> > Exception
> > > > > > > exception)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > and replace it with:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > default void onCompletion(RecordMetadata
> metadata,
> > > > > > Exception
> > > > > > > > > > > exception,
> > > > > > > > > > > > > > boolean isFatal) {
> > > > > > > > > > > > > >   this.onCompletion(metadata, exception);
> > > > > > > > > > > > > > }
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > to make sure new users get the benefit of
> > > understanding
> > > > > the
> > > > > > > > > > fatality
> > > > > > > > > > > > > based
> > > > > > > > > > > > > > on the info presented by the producer?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <
> > > > > > > > info@hiringuru.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Why  we are receiving all emails kindly
> remove
> > us
> > > > > from
> > > > > > > > > > > > > > > > dev@kafka.apache.org we don't want to
> receive
> > > emails
> > > > > > > > > anymore.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <
> > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks Boyang, yes I think I was confused
> > > about the
> > > > > > > > > different
> > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > two abortTxn calls, and now I get it was
> not
> > > > > > > > intentional. I
> > > > > > > > > > > > think I
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > have more concerns.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang
> Chen <
> > > > > > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Thanks for the clarification Guozhang, I
> > got
> > > your
> > > > > > > point
> > > > > > > > > > that
> > > > > > > > > > > we
> > > > > > > > > > > > > > want
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > have a consistent handling of fatal
> > > exceptions
> > > > > > being
> > > > > > > > > thrown
> > > > > > > > > > > > from
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > abortTxn. I modified the current template
> > to
> > > move
> > > > > > the
> > > > > > > > > fatal
> > > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > > > try-catch outside of the processing loop
> to
> > > make
> > > > > > sure
> > > > > > > > we
> > > > > > > > > > > could
> > > > > > > > > > > > > get
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > chance
> > > > > > > > > > > > > > > > > > to close consumer/producer modules. Let
> me
> > > know
> > > > > > what
> > > > > > > > you
> > > > > > > > > > > think.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang
> > Chen
> > > <
> > > > > > > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > My understanding is that
> abortTransaction
> > > would
> > > > > > > only
> > > > > > > > > > throw
> > > > > > > > > > > > when
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > producer is in fatal state. For
> > > CommitFailed,
> > > > > the
> > > > > > > > > > producer
> > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > > in the abortable error state, so that
> > > > > > > > abortTransaction
> > > > > > > > > > call
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > throw.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM
> Guozhang
> > > Wang
> > > > > <
> > > > > > > > > > > > > > wangguoz@gmail.com
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >> Or are you going to maintain some
> > internal
> > > > > state
> > > > > > > > such
> > > > > > > > > > > that,
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> `abortTransaction` in the catch block
> > > would
> > > > > > never
> > > > > > > > > throw
> > > > > > > > > > > > again?
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM
> > Guozhang
> > > > > Wang <
> > > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > I've also thought about this (i.e.
> > using
> > > > > > > > > CommitFailed
> > > > > > > > > > > for
> > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > >> non-fatal),
> > > > > > > > > > > > > > > > > > >> > but what I'm pondering is that, in
> the
> > > catch
> > > > > > > > > > > > (CommitFailed)
> > > > > > > > > > > > > > > block,
> > > > > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > > >> > would happen if the
> > > > > > > `producer.abortTransaction();`
> > > > > > > > > > > throws
> > > > > > > > > > > > > > again?
> > > > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > > >> > that be captured as a fatal and
> cause
> > > the
> > > > > > client
> > > > > > > > to
> > > > > > > > > > > close
> > > > > > > > > > > > > > again.
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > If yes, then naively the pattern
> would
> > > be:
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > > > > > > > > > > >> >         // Transaction commit failed
> > > with
> > > > > > > > abortable
> > > > > > > > > > > error,
> > > > > > > > > > > > > > user
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > >> reset
> > > > > > > > > > > > > > > > > > >> >         // the application state and
> > > resume
> > > > > > > with a
> > > > > > > > > new
> > > > > > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > > >> > root
> > > > > > > > > > > > > > > > > > >> >         // cause was wrapped in the
> > > thrown
> > > > > > > > > exception.
> > > > > > > > > > > > > > > > > > >> >
> > > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > > > > >> >         try {
> > > > > > > > > > > > > > > > > > >> >
> >  producer.abortTransaction();
> > > > > > > > > > > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > > > > > > > > > > >> >             producer.close();
> > > > > > > > > > > > > > > > > > >> >             consumer.close();
> > > > > > > > > > > > > > > > > > >> >             throw e;
> > > > > > > > > > > > > > > > > > >> >         }
> > > > > > > > > > > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > > > > > > > > > > >> >         producer.close();
> > > > > > > > > > > > > > > > > > >> >         consumer.close();
> > > > > > > > > > > > > > > > > > >> >         throw e;
> > > > > > > > > > > > > > > > > > >> >     }
> > > > > > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > Guozhang
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM
> > Boyang
> > > > > Chen <
> > > > > > > > > > > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> Jason and I were discussing the new
> > API
> > > > > > offline
> > > > > > > > and
> > > > > > > > > > > > decided
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > take
> > > > > > > > > > > > > > > > > > >> >> another
> > > > > > > > > > > > > > > > > > >> >> approach. Firstly, the reason not
> to
> > > > > invent a
> > > > > > > new
> > > > > > > > > API
> > > > > > > > > > > > with
> > > > > > > > > > > > > > > > returned
> > > > > > > > > > > > > > > > > > >> >> boolean
> > > > > > > > > > > > > > > > > > >> >> flag is for compatibility
> > > consideration,
> > > > > > since
> > > > > > > > old
> > > > > > > > > > EOS
> > > > > > > > > > > > code
> > > > > > > > > > > > > > > > would not
> > > > > > > > > > > > > > > > > > >> know
> > > > > > > > > > > > > > > > > > >> >> that a given transaction commit was
> > > failed
> > > > > > > > > internally
> > > > > > > > > > > as
> > > > > > > > > > > > > they
> > > > > > > > > > > > > > > > don't
> > > > > > > > > > > > > > > > > > >> listen
> > > > > > > > > > > > > > > > > > >> >> to the output flag. Our proposed
> > > > > alternative
> > > > > > > > > solution
> > > > > > > > > > > is
> > > > > > > > > > > > to
> > > > > > > > > > > > > > let
> > > > > > > > > > > > > > > > > > >> >> *commitTransaction
> > > > > > > > > > > > > > > > > > >> >> throw CommitFailedException
> whenever
> > > the
> > > > > > commit
> > > > > > > > > > failed
> > > > > > > > > > > > with
> > > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > > >> >> exception*, so that on the caller
> > side
> > > the
> > > > > > > > handling
> > > > > > > > > > > logic
> > > > > > > > > > > > > > > > becomes:
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> try {
> > > > > > > > > > > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > > > > > > > > > > >> >>
> > >  producer.commitTransaction();
> > > > > > > > > > > > > > > > > > >> >>         } else {
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > > > > >> >>
> > >  producer.abortTransaction();
> > > > > > > > > > > > > > > > > > >> >>         }
> > > > > > > > > > > > > > > > > > >> >>     } catch (CommitFailedException
> > e) {
> > > > > > > > > > > > > > > > > > >> >>         // Transaction commit
> failed
> > > with
> > > > > > > > abortable
> > > > > > > > > > > > error,
> > > > > > > > > > > > > > user
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > >> >> reset
> > > > > > > > > > > > > > > > > > >> >>         // the application state
> and
> > > resume
> > > > > > > with
> > > > > > > > a
> > > > > > > > > > new
> > > > > > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > > >> >> root
> > > > > > > > > > > > > > > > > > >> >>         // cause was wrapped in the
> > > thrown
> > > > > > > > > exception.
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > > > > >> >>
>  producer.abortTransaction();
> > > > > > > > > > > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > > > > > > > > > > >> >>         producer.close();
> > > > > > > > > > > > > > > > > > >> >>         consumer.close();
> > > > > > > > > > > > > > > > > > >> >>         throw e;
> > > > > > > > > > > > > > > > > > >> >>     }
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> This approach looks cleaner as all
> > > > > exception
> > > > > > > > types
> > > > > > > > > > > other
> > > > > > > > > > > > > than
> > > > > > > > > > > > > > > > > > >> CommitFailed
> > > > > > > > > > > > > > > > > > >> >> will doom to be fatal, which is
> very
> > > easy
> > > > > to
> > > > > > > > adopt
> > > > > > > > > > for
> > > > > > > > > > > > > users.
> > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> >> meantime, we still maintain the
> > > commitTxn
> > > > > > > > behavior
> > > > > > > > > to
> > > > > > > > > > > > throw
> > > > > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > >> >> silently failing.
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> In addition, we decided to drop the
> > > > > > > > recommendation
> > > > > > > > > to
> > > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > > >> >> TimeoutException and leave it to
> the
> > > users
> > > > > to
> > > > > > > > make
> > > > > > > > > > the
> > > > > > > > > > > > > call.
> > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > > >> downside
> > > > > > > > > > > > > > > > > > >> >> for blindly calling abortTxn upon
> > > timeout
> > > > > is
> > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > > could
> > > > > > > > > > > > > > > > result in
> > > > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > >> >> illegal state when the commit was
> > > already
> > > > > > > > > successful
> > > > > > > > > > on
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > > > >> >> side. Without a good guarantee on
> the
> > > > > > outcome,
> > > > > > > > > > > > > > overcomplicating
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> >> template should not be encouraged
> > IMHO.
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> Let me know your thoughts on the
> new
> > > > > approach
> > > > > > > > here,
> > > > > > > > > > > thank
> > > > > > > > > > > > > > you!
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> Best,
> > > > > > > > > > > > > > > > > > >> >> Boyang
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM
> > > Guozhang
> > > > > > Wang
> > > > > > > <
> > > > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > > > > >> >> wrote:
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > Thanks for your clarification on
> > > 2)/3),
> > > > > > that
> > > > > > > > > makes
> > > > > > > > > > > > sense.
> > > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM
> > > Boyang
> > > > > > Chen
> > > > > > > <
> > > > > > > > > > > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > > >> >> > > Thanks for the input Guozhang,
> > > replied
> > > > > > > > inline.
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM
> > > > > Guozhang
> > > > > > > > Wang <
> > > > > > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > Thanks for the updated KIP. I
> > > read it
> > > > > > > again
> > > > > > > > > and
> > > > > > > > > > > > have
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> following
> > > > > > > > > > > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > 0. I'm a bit concerned that
> if
> > > > > > commitTxn
> > > > > > > > does
> > > > > > > > > > not
> > > > > > > > > > > > > throw
> > > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > > > >> >> non-fatal
> > > > > > > > > > > > > > > > > > >> >> > > > exception, and instead we
> rely
> > > on the
> > > > > > > > > > subsequent
> > > > > > > > > > > > > > beginTxn
> > > > > > > > > > > > > > > > call
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > >> >> > throw,
> > > > > > > > > > > > > > > > > > >> >> > > it
> > > > > > > > > > > > > > > > > > >> >> > > > may violate some callers
> with a
> > > > > pattern
> > > > > > > > that
> > > > > > > > > > > > relying
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > >> commitTxn to
> > > > > > > > > > > > > > > > > > >> >> > > > succeed to make some
> > non-rollback
> > > > > > > > operations.
> > > > > > > > > > For
> > > > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > > > > > > > > > > >> >> > > > // do some read-write on my
> > > local DB
> > > > > > > > > > > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > > > > > > > > > > >> >> > > > // if commitTxn succeeds,
> then
> > > commit
> > > > > > the
> > > > > > > > DB
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > -------------
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > The issue is that, committing
> > DB
> > > is a
> > > > > > > > > > > non-rollback
> > > > > > > > > > > > > > > > operation,
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > >> >> users
> > > > > > > > > > > > > > > > > > >> >> > > may
> > > > > > > > > > > > > > > > > > >> >> > > > just rely on commitTxn to
> > return
> > > > > > without
> > > > > > > > > error
> > > > > > > > > > to
> > > > > > > > > > > > > make
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > >> >> > non-rollback
> > > > > > > > > > > > > > > > > > >> >> > > > call. Of course we can just
> > claim
> > > > > this
> > > > > > > > > pattern
> > > > > > > > > > is
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > legitimate
> > > > > > > > > > > > > > > > > > >> >> and is
> > > > > > > > > > > > > > > > > > >> >> > > not
> > > > > > > > > > > > > > > > > > >> >> > > > the right way of doing
> things,
> > > but
> > > > > many
> > > > > > > > users
> > > > > > > > > > may
> > > > > > > > > > > > > > > naturally
> > > > > > > > > > > > > > > > > > adopt
> > > > > > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > > > > > >> >> > > > pattern.
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > So maybe we should still let
> > > > > commitTxn
> > > > > > > also
> > > > > > > > > > throw
> > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > > >> >> exceptions,
> > > > > > > > > > > > > > > > > > >> >> > > in
> > > > > > > > > > > > > > > > > > >> >> > > > which case we would then call
> > > > > abortTxn
> > > > > > > > again.
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > But if we do this, the
> pattern
> > > > > becomes:
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > > > > > > > > > > >> >> > > >     try {
> > > > > > > > > > > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > > > > > > > > > > >> >> > > >     } catch (...) {        //
> > > > > enumerate
> > > > > > > all
> > > > > > > > > > fatal
> > > > > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException)
> {
> > > > > > > > > > > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > > > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > if (!shouldCommit &&
> > running()) {
> > > > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException)
> {
> > > > > > > > > > > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > Which is much more
> complicated.
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > Thank makes me think, the
> > > alternative
> > > > > > we
> > > > > > > > have
> > > > > > > > > > > > > discussed
> > > > > > > > > > > > > > > > offline
> > > > > > > > > > > > > > > > > > >> may
> > > > > > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > > > > > >> >> > > > better: let commitTxn() to
> > > return a
> > > > > > > boolean
> > > > > > > > > > flag.
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > * If it returns true, it
> means
> > > the
> > > > > > commit
> > > > > > > > > > > > succeeded.
> > > > > > > > > > > > > > > Users
> > > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > > >> >> > > comfortably
> > > > > > > > > > > > > > > > > > >> >> > > > continue and do any external
> > > > > > non-rollback
> > > > > > > > > > > > operations
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > they
> > > > > > > > > > > > > > > > > > >> like.
> > > > > > > > > > > > > > > > > > >> >> > > > * If it returns false, it
> means
> > > the
> > > > > > > commit
> > > > > > > > > > failed
> > > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > > >> >> error
> > > > > > > > > > > > > > > > > > >> >> > > *AND
> > > > > > > > > > > > > > > > > > >> >> > > > the txn has been aborted*.
> > Users
> > > do
> > > > > not
> > > > > > > > need
> > > > > > > > > to
> > > > > > > > > > > > call
> > > > > > > > > > > > > > > > abortTxn
> > > > > > > > > > > > > > > > > > >> again.
> > > > > > > > > > > > > > > > > > >> >> > > > * If it throws, then it means
> > > fatal
> > > > > > > errors.
> > > > > > > > > > Users
> > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > shut
> > > > > > > > > > > > > > > > > > >> down
> > > > > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > > > > >> >> > > > client.
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > In this case, the pattern
> > > becomes:
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > > > > > > > > > > >> >> > > >         commitSucceeded =
> > > commitTxn()
> > > > > > > > > > > > > > > > > > >> >> > > >     } else {
> > > > > > > > > > > > > > > > > > >> >> > > >         // reset offsets,
> > > rollback
> > > > > > > > > operations,
> > > > > > > > > > > etc
> > > > > > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > > > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > > > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > > > > > > > > > > >> >> > > >    // do other
> non-rollbackable
> > > > > things
> > > > > > > > > > > > > > > > > > >> >> > > > else
> > > > > > > > > > > > > > > > > > >> >> > > >    // reset offsets, rollback
> > > > > > operations,
> > > > > > > > etc
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > Of course, if we want to have
> > > better
> > > > > > > > > visibility
> > > > > > > > > > > > into
> > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > > caused
> > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > >> >> > > commit
> > > > > > > > > > > > > > > > > > >> >> > > > to fail and txn to abort. We
> > can
> > > let
> > > > > > the
> > > > > > > > > return
> > > > > > > > > > > > type
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > enum
> > > > > > > > > > > > > > > > > > >> >> instead
> > > > > > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > > > > > >> >> > > > a flag. But my main idea is
> to
> > > still
> > > > > > let
> > > > > > > > the
> > > > > > > > > > > > > commitTxn
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> final
> > > > > > > > > > > > > > > > > > >> >> > point
> > > > > > > > > > > > > > > > > > >> >> > > > users can tell whether this
> txn
> > > > > > succeeded
> > > > > > > > or
> > > > > > > > > > not,
> > > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > >> >> relying on
> > > > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > I agree that adding a boolean
> > > flag is
> > > > > > > > indeed
> > > > > > > > > > > useful
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > case.
> > > > > > > > > > > > > > > > > > >> >> Will
> > > > > > > > > > > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> > > 1. Re: "while maintaining the
> > > behavior
> > > > > to
> > > > > > > > throw
> > > > > > > > > > > fatal
> > > > > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > >> >> raw"
> > > > > > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > > > > > >> >> > > > sure what do you mean by
> > "throw"
> > > > > here.
> > > > > > > Are
> > > > > > > > > you
> > > > > > > > > > > > > > proposing
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> >> callback
> > > > > > > > > > > > > > > > > > >> >> > > would
> > > > > > > > > > > > > > > > > > >> >> > > > *pass* (not throw) in any
> fatal
> > > > > > > exceptions
> > > > > > > > > when
> > > > > > > > > > > > > > triggered
> > > > > > > > > > > > > > > > > > without
> > > > > > > > > > > > > > > > > > >> >> > > wrapping?
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > Yes, I want to say *pass*,
> the
> > > > > benefit
> > > > > > is
> > > > > > > > to
> > > > > > > > > > make
> > > > > > > > > > > > the
> > > > > > > > > > > > > > end
> > > > > > > > > > > > > > > > > > user's
> > > > > > > > > > > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > > > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> > > > 2. I'm not sure I understand
> > the
> > > > > claim
> > > > > > > > > > regarding
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > callback
> > > > > > > > > > > > > > > > > > >> that
> > > > > > > > > > > > > > > > > > >> >> "In
> > > > > > > > > > > > > > > > > > >> >> > > EOS
> > > > > > > > > > > > > > > > > > >> >> > > > setup, it is not required to
> > > handle
> > > > > the
> > > > > > > > > > > exception".
> > > > > > > > > > > > > Are
> > > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > > >> >> proposing
> > > > > > > > > > > > > > > > > > >> >> > > that,
> > > > > > > > > > > > > > > > > > >> >> > > > e.g. in Streams, we do not
> try
> > to
> > > > > > handle
> > > > > > > > any
> > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > EOS
> > > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > >> >> > enabled
> > > > > > > > > > > > > > > > > > >> >> > > > in the callback anymore, but
> > > just let
> > > > > > > > > > commitTxn()
> > > > > > > > > > > > > > itself
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > > > > > >> >> > > > notified about the problem?
> > > > > Personally
> > > > > > I
> > > > > > > > > think
> > > > > > > > > > in
> > > > > > > > > > > > > > Streams
> > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > >> should
> > > > > > > > > > > > > > > > > > >> >> > just
> > > > > > > > > > > > > > > > > > >> >> > > > make the handling logic of
> the
> > > > > callback
> > > > > > > to
> > > > > > > > be
> > > > > > > > > > > > > > consistent
> > > > > > > > > > > > > > > > > > >> regardless
> > > > > > > > > > > > > > > > > > >> >> of
> > > > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > > > >> >> > > > EOS settings (today we have
> > > different
> > > > > > > logic
> > > > > > > > > > > > depending
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > >> logic,
> > > > > > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > > > > > >> >> > > > I think could be unified as
> > > well).
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > My idea originates from the
> > > claim on
> > > > > > send
> > > > > > > > > API:
> > > > > > > > > > > > > > > > > > >> >> > > "When used as part of a
> > > transaction, it
> > > > > > is
> > > > > > > > not
> > > > > > > > > > > > > necessary
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > define
> > > > > > > > > > > > > > > > > > >> a
> > > > > > > > > > > > > > > > > > >> >> > > callback or check the result of
> > the
> > > > > > future
> > > > > > > > in
> > > > > > > > > > > order
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > detect
> > > > > > > > > > > > > > > > > > >> errors
> > > > > > > > > > > > > > > > > > >> >> > from
> > > > > > > > > > > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > > > > > > > > > > >> >> > > My understanding is that for
> EOS,
> > > the
> > > > > > > > exception
> > > > > > > > > > > will
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > detected
> > > > > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > > > >> >> > calling
> > > > > > > > > > > > > > > > > > >> >> > > transactional APIs either way,
> so
> > > it
> > > > > is a
> > > > > > > > > > duplicate
> > > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > >> track
> > > > > > > > > > > > > > > > > > >> >> > > all the sendExceptions in
> > > > > > RecordCollector.
> > > > > > > > > > > However, I
> > > > > > > > > > > > > > > looked
> > > > > > > > > > > > > > > > up
> > > > > > > > > > > > > > > > > > >> >> > > sendException is being used
> today
> > > as
> > > > > > > follow:
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> > > 1. Pass to
> > > "ProductionExceptionHandler"
> > > > > > for
> > > > > > > > any
> > > > > > > > > > > > default
> > > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > > >> customized
> > > > > > > > > > > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > > > > > > > > > > >> >> > > 2. Stop collecting offset info
> or
> > > new
> > > > > > > > > exceptions
> > > > > > > > > > > > > > > > > > >> >> > > 3. Check and rethrow exceptions
> > in
> > > > > > close(),
> > > > > > > > > > flush()
> > > > > > > > > > > > or
> > > > > > > > > > > > > > new
> > > > > > > > > > > > > > > > send()
> > > > > > > > > > > > > > > > > > >> >> calls
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> > > To my understanding, we should
> > > still
> > > > > > honor
> > > > > > > > the
> > > > > > > > > > > > > commitment
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > #1
> > > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > >> >> any
> > > > > > > > > > > > > > > > > > >> >> > > user customized implementation.
> > > The #2
> > > > > > does
> > > > > > > > not
> > > > > > > > > > > > really
> > > > > > > > > > > > > > > > affect our
> > > > > > > > > > > > > > > > > > >> >> > decision
> > > > > > > > > > > > > > > > > > >> >> > > upon EOS. The #3 here is still
> > > valuable
> > > > > > as
> > > > > > > it
> > > > > > > > > > could
> > > > > > > > > > > > > help
> > > > > > > > > > > > > > us
> > > > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > > > > >> fast
> > > > > > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > > > > > >> >> > > new send() instead of waiting
> to
> > > later
> > > > > > > stage
> > > > > > > > of
> > > > > > > > > > > > > > processing.
> > > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > >> >> > sense,
> > > > > > > > > > > > > > > > > > >> >> > > I agree we should continue to
> > > record
> > > > > send
> > > > > > > > > > > exceptions
> > > > > > > > > > > > > even
> > > > > > > > > > > > > > > > under
> > > > > > > > > > > > > > > > > > EOS
> > > > > > > > > > > > > > > > > > >> >> case
> > > > > > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > > > > > >> >> > > ensure the strength of stream
> > side
> > > > > > Producer
> > > > > > > > > > logic.
> > > > > > > > > > > On
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > safer
> > > > > > > > > > > > > > > > > > >> side,
> > > > > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > > > > >> >> > no
> > > > > > > > > > > > > > > > > > >> >> > > longer need to wrap certain
> fatal
> > > > > > > exceptions
> > > > > > > > > like
> > > > > > > > > > > > > > > > ProducerFenced
> > > > > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > > > >> >> > > TaskMigrated, since they should
> > not
> > > > > crash
> > > > > > > the
> > > > > > > > > > > stream
> > > > > > > > > > > > > > thread
> > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > > >> thrown
> > > > > > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > > > > > >> >> > > raw format, once we adopt the
> new
> > > > > > > processing
> > > > > > > > > > model
> > > > > > > > > > > in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > send
> > > > > > > > > > > > > > > > > > >> phase.
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42
> PM
> > > > > Boyang
> > > > > > > > Chen <
> > > > > > > > > > > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > Thanks for everyone's
> > feedback
> > > so
> > > > > > far.
> > > > > > > I
> > > > > > > > > have
> > > > > > > > > > > > > > polished
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > > > > > >> >> after
> > > > > > > > > > > > > > > > > > >> >> > > > > offline discussion with
> some
> > > folks
> > > > > > > > working
> > > > > > > > > on
> > > > > > > > > > > EOS
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > make the
> > > > > > > > > > > > > > > > > > >> >> > exception
> > > > > > > > > > > > > > > > > > >> >> > > > > handling more lightweight.
> > The
> > > > > > > essential
> > > > > > > > > > change
> > > > > > > > > > > > is
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > >> not
> > > > > > > > > > > > > > > > > > >> >> > > > > inventing a new
> intermediate
> > > > > > exception
> > > > > > > > > type,
> > > > > > > > > > > but
> > > > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > > > > >> >> separating a
> > > > > > > > > > > > > > > > > > >> >> > > > full
> > > > > > > > > > > > > > > > > > >> >> > > > > transaction into two
> phases:
> > > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > 1. The data transmission
> > phase
> > > > > > > > > > > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > This way for any exception
> > > thrown
> > > > > > from
> > > > > > > > > phase
> > > > > > > > > > 1,
> > > > > > > > > > > > > will
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > >> >> indicator
> > > > > > > > > > > > > > > > > > >> >> > in
> > > > > > > > > > > > > > > > > > >> >> > > > > phase 2 whether we should
> do
> > > commit
> > > > > > or
> > > > > > > > > abort,
> > > > > > > > > > > and
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > now on
> > > > > > > > > > > > > > > > > > >> >> > > > > `commitTransaction` should
> > only
> > > > > throw
> > > > > > > > fatal
> > > > > > > > > > > > > > exceptions,
> > > > > > > > > > > > > > > > > > >> similar to
> > > > > > > > > > > > > > > > > > >> >> > > > > `abortTransaction`, so that
> > any
> > > > > > > > > > KafkaException
> > > > > > > > > > > > > caught
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> >> commit
> > > > > > > > > > > > > > > > > > >> >> > > phase
> > > > > > > > > > > > > > > > > > >> >> > > > > will be definitely fatal to
> > > crash
> > > > > the
> > > > > > > > app.
> > > > > > > > > > For
> > > > > > > > > > > > more
> > > > > > > > > > > > > > > > advanced
> > > > > > > > > > > > > > > > > > >> users
> > > > > > > > > > > > > > > > > > >> >> > such
> > > > > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > > > > >> >> > > > > Streams, we have the
> ability
> > to
> > > > > > further
> > > > > > > > > wrap
> > > > > > > > > > > > > selected
> > > > > > > > > > > > > > > > types
> > > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > > > > > >> >> > > > > exceptions to trigger task
> > > > > migration
> > > > > > if
> > > > > > > > > > > > necessary.
> > > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > More details in the KIP,
> feel
> > > free
> > > > > to
> > > > > > > > take
> > > > > > > > > > > > another
> > > > > > > > > > > > > > > look,
> > > > > > > > > > > > > > > > > > >> thanks!
> > > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at
> 8:36
> > PM
> > > > > > Boyang
> > > > > > > > > Chen <
> > > > > > > > > > > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > > Thanks Bruno for the
> > > feedback.
> > > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at
> 5:26
> > > AM
> > > > > > Bruno
> > > > > > > > > > Cadonna
> > > > > > > > > > > <
> > > > > > > > > > > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> Thanks Boyang for the
> KIP!
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> Like Matthias, I do also
> > not
> > > > > know
> > > > > > > the
> > > > > > > > > > > producer
> > > > > > > > > > > > > > > > internal
> > > > > > > > > > > > > > > > > > well
> > > > > > > > > > > > > > > > > > >> >> > enough
> > > > > > > > > > > > > > > > > > >> >> > > to
> > > > > > > > > > > > > > > > > > >> >> > > > > >> comment on the
> > > categorization.
> > > > > > > > However,
> > > > > > > > > I
> > > > > > > > > > > > think
> > > > > > > > > > > > > > > > having a
> > > > > > > > > > > > > > > > > > >> super
> > > > > > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > > > > > >> >> > > > > >> (e.g.
> RetriableException)
> > > that
> > > > > > > > encodes
> > > > > > > > > if
> > > > > > > > > > > an
> > > > > > > > > > > > > > > > exception is
> > > > > > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > > > > > >> >> > > > not
> > > > > > > > > > > > > > > > > > >> >> > > > > >> is cleaner, better
> > > > > understandable
> > > > > > > and
> > > > > > > > > less
> > > > > > > > > > > > > > > > error-prone,
> > > > > > > > > > > > > > > > > > >> because
> > > > > > > > > > > > > > > > > > >> >> > > > ideally
> > > > > > > > > > > > > > > > > > >> >> > > > > >> when you add a new
> > non-fatal
> > > > > > > exception
> > > > > > > > > in
> > > > > > > > > > > > future
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > just
> > > > > > > > > > > > > > > > > > >> need
> > > > > > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > > > > > >> >> > > > > >> about letting it inherit
> > > from
> > > > > the
> > > > > > > > super
> > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > >> >> rest
> > > > > > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > > > > > >> >> > > > > >> the code will just
> behave
> > > > > > correctly
> > > > > > > > > > without
> > > > > > > > > > > > the
> > > > > > > > > > > > > > need
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > wrap
> > > > > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > > > > >> >> > new
> > > > > > > > > > > > > > > > > > >> >> > > > > >> exception into another
> > > exception
> > > > > > > each
> > > > > > > > > time
> > > > > > > > > > > it
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > > > >> (maybe
> > > > > > > > > > > > > > > > > > >> >> it
> > > > > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > > > > >> >> > > > > >> thrown at different
> > > location in
> > > > > > the
> > > > > > > > > code).
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> As far as I understand
> the
> > > > > > following
> > > > > > > > > > > statement
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > > > >> >> previous
> > > > > > > > > > > > > > > > > > >> >> > > > e-mail
> > > > > > > > > > > > > > > > > > >> >> > > > > >> is the reason that
> > currently
> > > > > such
> > > > > > a
> > > > > > > > > super
> > > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > is not
> > > > > > > > > > > > > > > > > > >> >> > possible:
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> "Right now we have
> > > > > > > RetriableException
> > > > > > > > > > type,
> > > > > > > > > > > if
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > going
> > > > > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > > > > >> >> > add a
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > `ProducerRetriableException`
> > > > > type,
> > > > > > > we
> > > > > > > > > have
> > > > > > > > > > > to
> > > > > > > > > > > > > put
> > > > > > > > > > > > > > > > this new
> > > > > > > > > > > > > > > > > > >> >> > interface
> > > > > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > > > > >> >> > > > > >> the parent of the
> > > > > > > RetriableException,
> > > > > > > > > > > because
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > > > > > >> >> > > > > >> exceptions are
> `retriable`
> > > in
> > > > > > > general
> > > > > > > > > for
> > > > > > > > > > > > > > producer"
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> In the list of
> exceptions
> > in
> > > > > your
> > > > > > > > KIP, I
> > > > > > > > > > > found
> > > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > > >> >> > exceptions
> > > > > > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > > > > > >> >> > > > > >> do not inherit from
> > > > > > > > RetriableException.
> > > > > > > > > I
> > > > > > > > > > > > guess
> > > > > > > > > > > > > > > those
> > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > >> >> ones
> > > > > > > > > > > > > > > > > > >> >> > > you
> > > > > > > > > > > > > > > > > > >> >> > > > > >> are referring to in your
> > > > > > statement:
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > InvalidProducerEpochException
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> InvalidPidMappingException
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > TransactionAbortedException
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> All of those exceptions
> > are
> > > > > > > non-fatal
> > > > > > > > > and
> > > > > > > > > > do
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > > inherit
> > > > > > > > > > > > > > > > > > >> from
> > > > > > > > > > > > > > > > > > >> >> > > > > >> RetriableException. Is
> > > there a
> > > > > > > reason
> > > > > > > > > for
> > > > > > > > > > > > that?
> > > > > > > > > > > > > If
> > > > > > > > > > > > > > > > they
> > > > > > > > > > > > > > > > > > >> >> depended
> > > > > > > > > > > > > > > > > > >> >> > > from
> > > > > > > > > > > > > > > > > > >> >> > > > > >> RetriableException we
> > would
> > > be a
> > > > > > bit
> > > > > > > > > > closer
> > > > > > > > > > > > to a
> > > > > > > > > > > > > > > super
> > > > > > > > > > > > > > > > > > >> >> exception I
> > > > > > > > > > > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> The reason is that
> sender
> > > may
> > > > > > catch
> > > > > > > > > those
> > > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> >> > > > > > ProduceResponse, and it
> > > currently
> > > > > > > does
> > > > > > > > > > > infinite
> > > > > > > > > > > > > > > > > > >> >> > > > > > retries on
> > > RetriableException. To
> > > > > > > make
> > > > > > > > > sure
> > > > > > > > > > > we
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > trigger
> > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > >> >> > > > > > abortTransaction() by
> > > catching
> > > > > > > > non-fatal
> > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > > > >> >> > > > > > exceptions and
> reinitialize
> > > the
> > > > > txn
> > > > > > > > > state,
> > > > > > > > > > we
> > > > > > > > > > > > > chose
> > > > > > > > > > > > > > > > not to
> > > > > > > > > > > > > > > > > > >> let
> > > > > > > > > > > > > > > > > > >> >> > those
> > > > > > > > > > > > > > > > > > >> >> > > > > > exceptions inherit
> > > > > > RetriableException
> > > > > > > > so
> > > > > > > > > > that
> > > > > > > > > > > > > > > > > > >> >> > > > > > they won't cause infinite
> > > retry
> > > > > on
> > > > > > > > > sender.
> > > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> With
> > > OutOfOrderSequenceException
> > > > > > and
> > > > > > > > > > > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > > > > > >> >> > > > > >> to understand that their
> > > > > fatality
> > > > > > > > > depends
> > > > > > > > > > on
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > type
> > > > > > > > > > > > > > > > > > (i.e.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> configuration) of the
> > > producer.
> > > > > > That
> > > > > > > > > makes
> > > > > > > > > > > it
> > > > > > > > > > > > > > > > difficult to
> > > > > > > > > > > > > > > > > > >> >> have a
> > > > > > > > > > > > > > > > > > >> >> > > > super
> > > > > > > > > > > > > > > > > > >> >> > > > > >> exception that encodes
> the
> > > > > > > > retriability
> > > > > > > > > as
> > > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > > above.
> > > > > > > > > > > > > > > > > > >> >> Would
> > > > > > > > > > > > > > > > > > >> >> > it
> > > > > > > > > > > > > > > > > > >> >> > > > be
> > > > > > > > > > > > > > > > > > >> >> > > > > >> possible to introduce
> > > exceptions
> > > > > > > that
> > > > > > > > > > > inherit
> > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > >> >> > > RetriableException
> > > > > > > > > > > > > > > > > > >> >> > > > > >> and that are thrown when
> > > those
> > > > > > > > > exceptions
> > > > > > > > > > > are
> > > > > > > > > > > > > > caught
> > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > >> >> > > brokers
> > > > > > > > > > > > > > > > > > >> >> > > > > >> and the type of the
> > > producer is
> > > > > > such
> > > > > > > > > that
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > > >> >> > > > retriable?
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> Yea, I think in general
> > the
> > > > > > > exception
> > > > > > > > > type
> > > > > > > > > > > > > mixing
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > > >> difficult
> > > > > > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > > > > > >> >> > > get
> > > > > > > > > > > > > > > > > > >> >> > > > > > them all right. I have
> > > already
> > > > > > > proposed
> > > > > > > > > > > another
> > > > > > > > > > > > > > > > solution
> > > > > > > > > > > > > > > > > > >> based
> > > > > > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > > > > > >> >> > my
> > > > > > > > > > > > > > > > > > >> >> > > > > > offline discussion with
> > some
> > > > > folks
> > > > > > > > > working
> > > > > > > > > > on
> > > > > > > > > > > > EOS
> > > > > > > > > > > > > > > > > > >> >> > > > > > to make the handling more
> > > > > > > > straightforward
> > > > > > > > > > for
> > > > > > > > > > > > end
> > > > > > > > > > > > > > > users
> > > > > > > > > > > > > > > > > > >> without
> > > > > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > > > > >> >> > > > need
> > > > > > > > > > > > > > > > > > >> >> > > > > > to distinguish exception
> > > > > fatality.
> > > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > > > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34,
> > Guozhang
> > > Wang
> > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the
> > > > > proposal!
> > > > > > I
> > > > > > > > > made a
> > > > > > > > > > > > pass
> > > > > > > > > > > > > > over
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> list
> > > > > > > > > > > > > > > > > > >> >> and
> > > > > > > > > > > > > > > > > > >> >> > > here
> > > > > > > > > > > > > > > > > > >> >> > > > > are
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > 0) Although this is
> not
> > > part
> > > > > of
> > > > > > > the
> > > > > > > > > > public
> > > > > > > > > > > > > API,
> > > > > > > > > > > > > > I
> > > > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > >> >> > should
> > > > > > > > > > > > > > > > > > >> >> > > > make
> > > > > > > > > > > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > that the suggested
> > pattern
> > > > > (i.e.
> > > > > > > > user
> > > > > > > > > > can
> > > > > > > > > > > > > always
> > > > > > > > > > > > > > > > call
> > > > > > > > > > > > > > > > > > >> >> abortTxn()
> > > > > > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > handling non-fatal
> > > errors) are
> > > > > > > > indeed
> > > > > > > > > > > > > supported.
> > > > > > > > > > > > > > > > E.g. if
> > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > >> >> txn
> > > > > > > > > > > > > > > > > > >> >> > > is
> > > > > > > > > > > > > > > > > > >> >> > > > > >> already
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > aborted by the broker
> > > side,
> > > > > then
> > > > > > > > users
> > > > > > > > > > can
> > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > call
> > > > > > > > > > > > > > > > > > >> >> abortTxn
> > > > > > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > > > > > >> >> > > > > >> would
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > not throw another
> > > exception
> > > > > but
> > > > > > > just
> > > > > > > > > be
> > > > > > > > > > > > > treated
> > > > > > > > > > > > > > > as a
> > > > > > > > > > > > > > > > > > >> no-op.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > 1)
> > > > > > > > *ConcurrentTransactionsException*:
> > > > > > > > > I
> > > > > > > > > > > > think
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > error
> > > > > > > > > > > > > > > > > > >> can
> > > > > > > > > > > > > > > > > > >> >> > also
> > > > > > > > > > > > > > > > > > >> >> > > be
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > returned but not
> > > documented
> > > > > yet.
> > > > > > > > This
> > > > > > > > > > > should
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > > > > > >> >> > error.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > 2)
> > > *InvalidTxnStateException*:
> > > > > > > this
> > > > > > > > > > error
> > > > > > > > > > > is
> > > > > > > > > > > > > > > > returned
> > > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > >> >> > broker
> > > > > > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > state transition
> failed
> > > (e.g.
> > > > > it
> > > > > > > is
> > > > > > > > > > trying
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > transit to
> > > > > > > > > > > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > while the current
> state
> > > is not
> > > > > > > > > > > > > prepare-commit).
> > > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > > error
> > > > > > > > > > > > > > > > > > >> >> could
> > > > > > > > > > > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > a bug on the client
> > > internal
> > > > > > code
> > > > > > > or
> > > > > > > > > the
> > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > code,
> > > > > > > > > > > > > > > > > > OR a
> > > > > > > > > > > > > > > > > > >> >> user
> > > > > > > > > > > > > > > > > > >> >> > > > error
> > > > > > > > > > > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > similar error is
> > > > > > > > > > > > > > ConcurrentTransactionsException,
> > > > > > > > > > > > > > > > i.e.
> > > > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > > >> >> Kafka
> > > > > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > these exceptions would
> > > only be
> > > > > > > > > returned
> > > > > > > > > > if
> > > > > > > > > > > > > users
> > > > > > > > > > > > > > > > try to
> > > > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > > >> >> > > something
> > > > > > > > > > > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn
> > > right
> > > > > > after
> > > > > > > a
> > > > > > > > > > > > commitTxn,
> > > > > > > > > > > > > > > etc.
> > > > > > > > > > > > > > > > So
> > > > > > > > > > > > > > > > > > I'm
> > > > > > > > > > > > > > > > > > >> >> > > thinking
> > > > > > > > > > > > > > > > > > >> >> > > > it
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > should be a non-fatal
> > > error
> > > > > > > instead
> > > > > > > > > of a
> > > > > > > > > > > > fatal
> > > > > > > > > > > > > > > > error,
> > > > > > > > > > > > > > > > > > >> wdyt?
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*:
> > case
> > > i
> > > > > > > > "indicates
> > > > > > > > > > > fatal
> > > > > > > > > > > > > > > > > > transactional
> > > > > > > > > > > > > > > > > > >> >> > > sequence
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a
> bit
> > > > > > > conflicting
> > > > > > > > > with
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > > > > > > > > > > >> >> > > > > >> that
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > is treated as
> > non-fatal. I
> > > > > guess
> > > > > > > > your
> > > > > > > > > > > > proposal
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > OutOfOrderSequenceException
> > > > > > would
> > > > > > > be
> > > > > > > > > > > treated
> > > > > > > > > > > > > > > either
> > > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > > > >> fatal
> > > > > > > > > > > > > > > > > > >> >> > with
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > transactional
> producer,
> > or
> > > > > > > non-fatal
> > > > > > > > > > with
> > > > > > > > > > > > > > > idempotent
> > > > > > > > > > > > > > > > > > >> >> producer,
> > > > > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > right? If the producer
> > is
> > > only
> > > > > > > > > > configured
> > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > > idempotency
> > > > > > > > > > > > > > > > > > >> >> but
> > > > > > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > transaction, then
> > > throwing a
> > > > > > > > > > > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > > > > > > > > > > >> >> > > for
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > non-fatal errors would
> > be
> > > > > > > confusing
> > > > > > > > > > since
> > > > > > > > > > > > > users
> > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > > >> using
> > > > > > > > > > > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > at all.. So I suggest
> we
> > > > > always
> > > > > > > > throw
> > > > > > > > > > > > > > > > > > >> OutOfSequenceException
> > > > > > > > > > > > > > > > > > >> >> > as-is
> > > > > > > > > > > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > not wrapped) no matter
> > > how the
> > > > > > > > > producer
> > > > > > > > > > is
> > > > > > > > > > > > > > > > configured,
> > > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > >> >> let
> > > > > > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > decide how to handle
> it
> > > based
> > > > > on
> > > > > > > > > whether
> > > > > > > > > > > it
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > only
> > > > > > > > > > > > > > > > > > >> >> idempotent
> > > > > > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn
> > > APIs,
> > > > > the
> > > > > > > > > > `send()`
> > > > > > > > > > > > > > > callback /
> > > > > > > > > > > > > > > > > > >> future
> > > > > > > > > > > > > > > > > > >> >> can
> > > > > > > > > > > > > > > > > > >> >> > > also
> > > > > > > > > > > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > txn-related
> exceptions,
> > I
> > > > > think
> > > > > > > this
> > > > > > > > > KIP
> > > > > > > > > > > > > should
> > > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > > > cover
> > > > > > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > > > > > >> >> > > API
> > > > > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > 5) This is related to
> > 1/2)
> > > > > > above:
> > > > > > > > > > > sometimes
> > > > > > > > > > > > > > those
> > > > > > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > > > > > >> >> > > > > like
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or
> > > > > InvalidTxnState
> > > > > > > are
> > > > > > > > > not
> > > > > > > > > > > due
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > state
> > > > > > > > > > > > > > > > > > >> >> being
> > > > > > > > > > > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > at the broker side,
> but
> > > maybe
> > > > > > > users
> > > > > > > > > are
> > > > > > > > > > > > doing
> > > > > > > > > > > > > > > > something
> > > > > > > > > > > > > > > > > > >> >> wrong.
> > > > > > > > > > > > > > > > > > >> >> > So
> > > > > > > > > > > > > > > > > > >> >> > > > I'm
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > wondering if we should
> > > further
> > > > > > > > > > distinguish
> > > > > > > > > > > > > those
> > > > > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > > > > > >> >> > > > > >> between
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > a) those that are
> caused
> > > by
> > > > > > Kafka
> > > > > > > > > > itself,
> > > > > > > > > > > > > e.g. a
> > > > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > > > >> timed
> > > > > > > > > > > > > > > > > > >> >> > out
> > > > > > > > > > > > > > > > > > >> >> > > > and
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > aborted a txn and
> later
> > an
> > > > > > endTxn
> > > > > > > > > > request
> > > > > > > > > > > is
> > > > > > > > > > > > > > > > received,
> > > > > > > > > > > > > > > > > > >> and b)
> > > > > > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > > > > > >> >> > > > > user's
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > API call pattern is
> > > incorrect,
> > > > > > > > causing
> > > > > > > > > > the
> > > > > > > > > > > > > > request
> > > > > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > > > >> >> > rejected
> > > > > > > > > > > > > > > > > > >> >> > > > with
> > > > > > > > > > > > > > > > > > >> >> > > > > >> an
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > error code from the
> > > broker.
> > > > > > > > > > > > > > > > > > >> >>
> *TransactionStateCorruptedException*
> > > > > > > > > > > > > > > > > > >> >> > > > feels
> > > > > > > > > > > > > > > > > > >> >> > > > > >> to
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > me more like for case
> > a),
> > > but
> > > > > > not
> > > > > > > > case
> > > > > > > > > > b).
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at
> > > 4:50 PM
> > > > > > > > Boyang
> > > > > > > > > > > Chen <
> > > > > > > > > > > > > > > > > > >> >> > > > > reluctanthero104@gmail.com
> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I
> > think
> > > your
> > > > > > > > > proposal
> > > > > > > > > > > > makes
> > > > > > > > > > > > > > > sense
> > > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > > > >> well,
> > > > > > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > > > >> >> > > > > pro
> > > > > > > > > > > > > > > > > > >> >> > > > > >> side
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> we could have a
> > > universally
> > > > > > > agreed
> > > > > > > > > > > > exception
> > > > > > > > > > > > > > type
> > > > > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > > > >> >> caught
> > > > > > > > > > > > > > > > > > >> >> > by
> > > > > > > > > > > > > > > > > > >> >> > > > the
> > > > > > > > > > > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> without having an
> extra
> > > layer
> > > > > > on
> > > > > > > > top
> > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > > > > > actual
> > > > > > > > > > > > > > > > > > >> >> exceptions.
> > > > > > > > > > > > > > > > > > >> >> > I
> > > > > > > > > > > > > > > > > > >> >> > > > > could
> > > > > > > > > > > > > > > > > > >> >> > > > > >> see
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> some issue on
> > downsides:
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> 1. The exception
> > > hierarchy
> > > > > will
> > > > > > > be
> > > > > > > > > more
> > > > > > > > > > > > > > complex.
> > > > > > > > > > > > > > > > Right
> > > > > > > > > > > > > > > > > > >> now
> > > > > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > > > > >> >> > > have
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> RetriableException
> > type,
> > > if
> > > > > we
> > > > > > > are
> > > > > > > > > > going
> > > > > > > > > > > to
> > > > > > > > > > > > > > add a
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > `ProducerRetriableException`
> > > > > > > type,
> > > > > > > > we
> > > > > > > > > > > have
> > > > > > > > > > > > to
> > > > > > > > > > > > > > put
> > > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > > >> new
> > > > > > > > > > > > > > > > > > >> >> > > > interface
> > > > > > > > > > > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> parent of the
> > > > > > RetriableException,
> > > > > > > > > > because
> > > > > > > > > > > > not
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> exceptions are
> > > `retriable` in
> > > > > > > > general
> > > > > > > > > > for
> > > > > > > > > > > > > > > > producer, for
> > > > > > > > > > > > > > > > > > >> >> example
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> We could have an
> empty
> > > > > > interface
> > > > > > > > > > > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > > > > > >> >> > > > let
> > > > > > > > > > > > > > > > > > >> >> > > > > >> all
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> the thrown exceptions
> > > > > implement
> > > > > > > for
> > > > > > > > > > sure,
> > > > > > > > > > > > but
> > > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > > bit
> > > > > > > > > > > > > > > > > > >> >> > > unorthodox
> > > > > > > > > > > > > > > > > > >> >> > > > > in
> > > > > > > > > > > > > > > > > > >> >> > > > > >> the
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> way we deal with
> > > exceptions
> > > > > in
> > > > > > > > > general.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> 2. There are cases
> > where
> > > we
> > > > > > > throw a
> > > > > > > > > > > > > > > KafkaException
> > > > > > > > > > > > > > > > > > >> wrapping
> > > > > > > > > > > > > > > > > > >> >> > > another
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> KafkaException as
> > either
> > > > > fatal
> > > > > > or
> > > > > > > > > > > > non-fatal.
> > > > > > > > > > > > > If
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > >> >> > > interface
> > > > > > > > > > > > > > > > > > >> >> > > > > to
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> solve #1, it is also
> > > required
> > > > > > to
> > > > > > > > > > > implement
> > > > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > > > >> bloated
> > > > > > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> which could replace
> > > > > > > KafkaException
> > > > > > > > > > type,
> > > > > > > > > > > as
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > couldn't
> > > > > > > > > > > > > > > > > > >> mark
> > > > > > > > > > > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> as retriable for
> sure.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> 3. In terms of the
> > > > > > encapsulation,
> > > > > > > > > > > wrapping
> > > > > > > > > > > > > > means
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > > >> >> limit
> > > > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> of affection to the
> > > producer
> > > > > > > only,
> > > > > > > > > > which
> > > > > > > > > > > is
> > > > > > > > > > > > > > > > important
> > > > > > > > > > > > > > > > > > >> since
> > > > > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > > > > >> >> > > > don't
> > > > > > > > > > > > > > > > > > >> >> > > > > >> want
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> shared exception
> types
> > to
> > > > > > > > implement a
> > > > > > > > > > > > > > > > producer-related
> > > > > > > > > > > > > > > > > > >> >> > interface,
> > > > > > > > > > > > > > > > > > >> >> > > > > such
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> as
> > > > > > > > UnknownTopicOrPartitionException.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020
> at
> > > 3:38
> > > > > PM
> > > > > > > > > Matthias
> > > > > > > > > > > J.
> > > > > > > > > > > > > Sax
> > > > > > > > > > > > > > <
> > > > > > > > > > > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP
> > > Boyang!
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> Overall,
> categorizing
> > > > > > exceptions
> > > > > > > > > > makes a
> > > > > > > > > > > > lot
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > sense.
> > > > > > > > > > > > > > > > > > >> As I
> > > > > > > > > > > > > > > > > > >> >> > > don't
> > > > > > > > > > > > > > > > > > >> >> > > > > know
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> the producer
> internals
> > > well
> > > > > > > > enough,
> > > > > > > > > I
> > > > > > > > > > > > cannot
> > > > > > > > > > > > > > > > comment
> > > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> categorization in
> > detail
> > > > > > though.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> What I am wondering
> > is,
> > > if
> > > > > we
> > > > > > > > should
> > > > > > > > > > > > > introduce
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > > >> exception
> > > > > > > > > > > > > > > > > > >> >> > > > > interface
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> that non-fatal
> > exception
> > > > > would
> > > > > > > > > > implement
> > > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > > >> >> creating a
> > > > > > > > > > > > > > > > > > >> >> > > new
> > > > > > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> that will wrap
> > non-fatal
> > > > > > > > exceptions?
> > > > > > > > > > > What
> > > > > > > > > > > > > > would
> > > > > > > > > > > > > > > > be the
> > > > > > > > > > > > > > > > > > >> >> > pros/cons
> > > > > > > > > > > > > > > > > > >> >> > > > for
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM,
> > > Boyang
> > > > > > Chen
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> I would like to
> > start a
> > > > > > > > discussion
> > > > > > > > > > > thread
> > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > > KIP-691:
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming
> to
> > > > > simplify
> > > > > > > the
> > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > > > > > >> logic
> > > > > > > > > > > > > > > > > > >> >> > for
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> transactional
> > Producer
> > > > > users
> > > > > > by
> > > > > > > > > > > > classifying
> > > > > > > > > > > > > > > > fatal and
> > > > > > > > > > > > > > > > > > >> >> > non-fatal
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> and throw them
> > > > > > correspondingly
> > > > > > > > for
> > > > > > > > > > > easier
> > > > > > > > > > > > > > catch
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > >> retry.
> > > > > > > > > > > > > > > > > > >> >> > Let
> > > > > > > > > > > > > > > > > > >> >> > > me
> > > > > > > > > > > > > > > > > > >> >> > > > > >> know
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > > > --
> > > > > > > > > > > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > > >> >> > --
> > > > > > > > > > > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >> > --
> > > > > > > > > > > > > > > > > > >> > -- Guozhang
> > > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > > > >> -- Guozhang
> > > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > -- Guozhang
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Boyang, sounds good to me.

-Jason

On Mon, Feb 1, 2021 at 9:32 AM Boyang Chen <re...@gmail.com>
wrote:

> Hey Jason and Chia-Ping,
>
> looking at the discussion thread for KIP-706, we are still in discussion
> with the final API version, so I made the decision to decouple the two KIPs
> so that this one could move forward. To be specific, we are still
> introducing ProduceFailed error but as a `near-future` improvement, which
> should be implemented right after KIP-706 is finalized.
>
> Let me know your thoughts.
>
> Best,
> Boyang
>
> On Sat, Jan 30, 2021 at 11:57 AM Chia-Ping Tsai <ch...@apache.org>
> wrote:
>
> > > a richer return type. Let me know if you are good with this, and
> whether
> > > Chia-Ping is also happy here :)
> >
> > sure. I'd like to join this bigger party :)
> >
> > On 2021/01/30 01:03:57, Boyang Chen <re...@gmail.com> wrote:
> > > This is a great proposal Jason, I already integrated KIP-691 with
> KIP-706
> > > template to provide a new ProduceFailedException as well as a new
> > > Producer#produce API with CompletionStage as old send API replacement
> > with
> > > a richer return type. Let me know if you are good with this, and
> whether
> > > Chia-Ping is also happy here :)
> > >
> > > Best,
> > > Boyang
> > >
> > > On Fri, Jan 29, 2021 at 4:23 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> > >
> > > > Hi Boyang,
> > > >
> > > > Ok, if we are going to ask all users to update their code anyway,
> > maybe we
> > > > can get more bang out of this.
> > > >
> > > > One of the remaining problems is that a user will only get the richer
> > > > information if they are supplying a Callback. If they wait on the
> > future,
> > > > then they will get the usual exception. Our hands are somewhat tied
> > here
> > > > because of compatibility. One thing we have wanted to do for a long
> > time is
> > > > introduce a new send method which returns
> > > > `CompletableFuture<RecordMetadata>` instead. If we did so, then we
> > wouldn't
> > > > need to change `Callback` at all. Instead we could make the contract
> > of the
> > > > new api that it will always return `SendFailedException`, which could
> > > > provide a method to get the failure type.
> > > >
> > > > I think Chia-Ping has been thinking about this:
> > > > https://issues.apache.org/jira/browse/KAFKA-12227. Is there a way to
> > > > consolidate our efforts in order to avoid unnecessary noise for
> users?
> > It
> > > > would be awesome to see a more modern API which also addresses the
> > > > confusing error handling.
> > > >
> > > > -Jason
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jan 29, 2021 at 3:48 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks Jason, my hope is that we could streamline the callback
> > function
> > > > > signature, so that in the next major release (3.0),
> > > > > onCompletion(RecordMetadata metadata, SendFailure sendFailure) will
> > > > become
> > > > > the only API to be implemented.
> > > > >
> > > > > On Fri, Jan 29, 2021 at 3:42 PM Jason Gustafson <
> jason@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hi Boyang,
> > > > > >
> > > > > > Do you think it's necessary to deprecate the old `onCompletion`
> > > > > callback? I
> > > > > > was thinking there's probably no harm leaving it around. Users
> > might
> > > > not
> > > > > > care about the failure type. Other than that, it looks good to
> me.
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > > On Thu, Jan 28, 2021 at 6:58 PM Boyang Chen <
> > > > reluctanthero104@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Guozhang,
> > > > > > >
> > > > > > > the CommitFailedException wrapping would cover all the
> non-fatal
> > > > > > exceptions
> > > > > > > as we listed out in the KIP, generally speaking any exception
> > that
> > > > > could
> > > > > > > recover safely by calling abortTxn should be wrapped.
> > > > > > >
> > > > > > > Best,
> > > > > > > Boyang
> > > > > > >
> > > > > > > On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <
> > wangguoz@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Boyang,
> > > > > > > >
> > > > > > > > I think maybe there's something cracking here :) I'm just
> > asking
> > > > for
> > > > > > > > clarifications that as of today, which non-fatal exceptions
> the
> > > > newly
> > > > > > > > introduced CommitFailedException would cover, and it seems to
> > be
> > > > only
> > > > > > 1)
> > > > > > > > unknown pid, 2) invalid pid mapping, and 3) concurrent
> > > > transactions.
> > > > > Is
> > > > > > > > that correct?
> > > > > > > >
> > > > > > > > On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <
> > > > > > reluctanthero104@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Guozhang, I think TimeoutException would not be covered
> > here
> > > > as
> > > > > > it
> > > > > > > > > potentially has a risk of hitting an illegal state on the
> > broker
> > > > > side
> > > > > > > > when
> > > > > > > > > the previous commit was actually successful. Users should
> > try to
> > > > > > > increase
> > > > > > > > > their max.block.ms to avoid hitting the timeout as a base
> > > > > > suggestion,
> > > > > > > > > which
> > > > > > > > > is discussed in the KIP.
> > > > > > > > >
> > > > > > > > > As for the umbrella exception, I agree it has values to
> some
> > > > > extent,
> > > > > > > but
> > > > > > > > in
> > > > > > > > > terms of adoption difficulty, additional exception types
> > usually
> > > > > make
> > > > > > > the
> > > > > > > > > error handling more complex than simplified, and we are
> > doing our
> > > > > > best
> > > > > > > to
> > > > > > > > > avoid compatibility issues caused by wrapping previously
> > thrown
> > > > raw
> > > > > > > > > exceptions that invalidates user error handling. And I'm
> > also not
> > > > > in
> > > > > > > > favor
> > > > > > > > > of our exception hierarchy today where all exceptions are
> > > > > subclasses
> > > > > > of
> > > > > > > > > KafkaException, and we already mixed its handling with
> other
> > > > > specific
> > > > > > > > > exception types that could be caught as superclass
> > > > KafkaException.
> > > > > > That
> > > > > > > > > could be addressed in a separate proposal if we see
> > necessary.
> > > > > > > > >
> > > > > > > > > Boyang
> > > > > > > > >
> > > > > > > > > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <
> > > > wangguoz@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Jason for the suggestion, that looks good to me
> too.
> > > > > > > > > >
> > > > > > > > > > Regarding the non-fatal exceptions wrapped as
> > CommitFailed, I
> > > > > would
> > > > > > > > like
> > > > > > > > > to
> > > > > > > > > > clarify if we would cover all the following cases: 1)
> > timeout,
> > > > 2)
> > > > > > > > unknown
> > > > > > > > > > pid, 3) invalid pid mapping, 4) concurrent transactions?
> > > > > > > > > >
> > > > > > > > > > BTW I think it still makes sense to use an umbrella
> > exception
> > > > in
> > > > > > case
> > > > > > > > in
> > > > > > > > > > the future we add more non-fatal cases even though today
> we
> > > > only
> > > > > > > have a
> > > > > > > > > > few.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Guozhang
> > > > > > > > > >
> > > > > > > > > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <
> > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks Jason, I agree with the proposed solution here,
> > will
> > > > > > update
> > > > > > > > the
> > > > > > > > > > KIP.
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <
> > > > > > > jason@confluent.io
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Boyang,
> > > > > > > > > > > >
> > > > > > > > > > > > It seems like a reasonable suggestion. I wonder if a
> > flag
> > > > is
> > > > > > > > > sufficient
> > > > > > > > > > > > though. The current `Callback` documentation treats
> > "fatal"
> > > > > > > errors
> > > > > > > > > from
> > > > > > > > > > > the
> > > > > > > > > > > > perspective of the individual message that was sent.
> > > > > > > > > > > >
> > > > > > > > > > > > ```
> > > > > > > > > > > >     *                  Non-Retriable exceptions
> > (fatal, the
> > > > > > > message
> > > > > > > > > > will
> > > > > > > > > > > > never be sent):
> > > > > > > > > > > > ```
> > > > > > > > > > > >
> > > > > > > > > > > > However, we also have fatal errors from the
> > perspective of
> > > > > the
> > > > > > > > > > > transaction
> > > > > > > > > > > > (e.g. when the producer gets fenced). Perhaps that
> > suggests
> > > > > we
> > > > > > > need
> > > > > > > > > > > > something richer than a boolean flag:
> > > > > > > > > > > >
> > > > > > > > > > > > At a high level, I think the following cases are
> > possible:
> > > > > > > > > > > >
> > > > > > > > > > > > - message rejected (e.g. message too large, invalid
> > topic)
> > > > > > > > > > > > - delivery failed after retries/delivery timeout
> (e.g.
> > > > > timeout,
> > > > > > > crc
> > > > > > > > > > > error,
> > > > > > > > > > > > not enough replicas)
> > > > > > > > > > > > - transaction failed (e.g. producer fenced, invalid
> > > > > transaction
> > > > > > > > > state)
> > > > > > > > > > > >
> > > > > > > > > > > > Perhaps instead we can have a type like the
> following:
> > > > > > > > > > > >
> > > > > > > > > > > > class SendFailure {
> > > > > > > > > > > >   FailureType failureType;
> > > > > > > > > > > >   Exception cause;
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > enum FailureType {
> > > > > > > > > > > >   MESSSAGE_REJECTED, DELIVERY_FAILED,
> > TRANSACTION_FAILED
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > (Not married to any of these names, just a starting
> > point.)
> > > > > > > > > > > >
> > > > > > > > > > > > Then we add a new `onCompletion` as you've suggested:
> > > > > > > > > > > >
> > > > > > > > > > > > default void onCompletion(RecordMetadata metadata,
> > > > > SendFailure
> > > > > > > > > > failure) {
> > > > > > > > > > > >   this.onCompletion(metadata, failure.cause());
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > This would give streams and other applications enough
> > > > > > information
> > > > > > > > to
> > > > > > > > > > know
> > > > > > > > > > > > whether the message can be retried and whether the
> > > > > transaction
> > > > > > > can
> > > > > > > > be
> > > > > > > > > > > > aborted.
> > > > > > > > > > > >
> > > > > > > > > > > > What do you think?
> > > > > > > > > > > >
> > > > > > > > > > > > -Jason
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks Jason for the thoughts.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <
> > > > > > > > > jason@confluent.io
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Boyang,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the iterations here. I think this is
> > > > something
> > > > > > we
> > > > > > > > > should
> > > > > > > > > > > > have
> > > > > > > > > > > > > > done a long time ago. It sounds like there are
> two
> > API
> > > > > > > changes
> > > > > > > > > > here:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. We are introducing the `CommitFailedException`
> > to
> > > > wrap
> > > > > > > > > abortable
> > > > > > > > > > > > > > errors that are raised from `commitTransaction`.
> > This
> > > > > > sounds
> > > > > > > > fine
> > > > > > > > > > to
> > > > > > > > > > > > me.
> > > > > > > > > > > > > As
> > > > > > > > > > > > > > far as I know, the only case we might need this
> is
> > when
> > > > > we
> > > > > > > add
> > > > > > > > > > > support
> > > > > > > > > > > > to
> > > > > > > > > > > > > > let producers recover from coordinator timeouts.
> > Are
> > > > > there
> > > > > > > any
> > > > > > > > > > > others?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think the purpose here is to ensure non-fatal
> > > > > exceptions
> > > > > > > are
> > > > > > > > > > > unified
> > > > > > > > > > > > > under the same
> > > > > > > > > > > > > exception umbrella, to make the proceeding to abort
> > any
> > > > > > ongoing
> > > > > > > > > > > > transaction
> > > > > > > > > > > > > much easier.
> > > > > > > > > > > > > I don't think `coordinator timeouts` is the only
> > case to
> > > > > > > recover
> > > > > > > > > > here,
> > > > > > > > > > > > > since we have other
> > > > > > > > > > > > > non-fatal exceptions such as UnknownPid.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. We are wrapping non-fatal errors raised from
> > `send` as
> > > > > > > > > > > > `KafkaException`.
> > > > > > > > > > > > > > The motivation for this is less clear to me and
> it
> > > > > doesn't
> > > > > > > look
> > > > > > > > > > like
> > > > > > > > > > > > the
> > > > > > > > > > > > > > example from the KIP depends on it. My concern
> > here is
> > > > > > > > > > compatibility.
> > > > > > > > > > > > > > Currently we have the following documentation for
> > the
> > > > > > > > `Callback`
> > > > > > > > > > API:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > ```
> > > > > > > > > > > > > >      *                  Non-Retriable exceptions
> > > > (fatal,
> > > > > > the
> > > > > > > > > > message
> > > > > > > > > > > > will
> > > > > > > > > > > > > > never be sent):
> > > > > > > > > > > > > >      *
> > > > > > > > > > > > > >      *                  InvalidTopicException
> > > > > > > > > > > > > >      *
> > OffsetMetadataTooLargeException
> > > > > > > > > > > > > >      *
> > RecordBatchTooLargeException
> > > > > > > > > > > > > >      *                  RecordTooLargeException
> > > > > > > > > > > > > >      *                  UnknownServerException
> > > > > > > > > > > > > >      *
> UnknownProducerIdException
> > > > > > > > > > > > > >      *
> > InvalidProducerEpochException
> > > > > > > > > > > > > >      *
> > > > > > > > > > > > > >      *                  Retriable exceptions
> > > > (transient,
> > > > > > may
> > > > > > > be
> > > > > > > > > > > covered
> > > > > > > > > > > > > by
> > > > > > > > > > > > > > increasing #.retries):
> > > > > > > > > > > > > >      *
> > > > > > > > > > > > > >      *                  CorruptRecordException
> > > > > > > > > > > > > >      *                  InvalidMetadataException
> > > > > > > > > > > > > >      *
> > > > > > NotEnoughReplicasAfterAppendException
> > > > > > > > > > > > > >      *
> NotEnoughReplicasException
> > > > > > > > > > > > > >      *                  OffsetOutOfRangeException
> > > > > > > > > > > > > >      *                  TimeoutException
> > > > > > > > > > > > > >      *
> > > > UnknownTopicOrPartitionException
> > > > > > > > > > > > > > ```
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If we wrap all the retriable exceptions
> documented
> > here
> > > > > as
> > > > > > > > > > > > > > `KafkaException`, wouldn't that break any error
> > > > handling
> > > > > > that
> > > > > > > > > users
> > > > > > > > > > > > might
> > > > > > > > > > > > > > already have? it's gonna introduce a
> compatibility
> > > > issue.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The original intention was to simplify `send`
> > callback
> > > > > > error
> > > > > > > > > > handling
> > > > > > > > > > > > by
> > > > > > > > > > > > > doing exception wrapping, as on Streams level
> > > > > > > > > > > > > we have to prepare an exhausting list of exceptions
> > to
> > > > > catch
> > > > > > as
> > > > > > > > > > fatal,
> > > > > > > > > > > > and
> > > > > > > > > > > > > the same lengthy list to catch as
> > > > > > > > > > > > > non-fatal. It would be much easier if we got
> `hints`
> > from
> > > > > the
> > > > > > > > > > callback.
> > > > > > > > > > > > > However,
> > > > > > > > > > > > > I agree there is a compatibility concern, what
> about
> > > > > > > deprecating
> > > > > > > > > the
> > > > > > > > > > > > > existing:
> > > > > > > > > > > > >
> > > > > > > > > > > > > void onCompletion(RecordMetadata metadata,
> Exception
> > > > > > exception)
> > > > > > > > > > > > >
> > > > > > > > > > > > > and replace it with:
> > > > > > > > > > > > >
> > > > > > > > > > > > > default void onCompletion(RecordMetadata metadata,
> > > > > Exception
> > > > > > > > > > exception,
> > > > > > > > > > > > > boolean isFatal) {
> > > > > > > > > > > > >   this.onCompletion(metadata, exception);
> > > > > > > > > > > > > }
> > > > > > > > > > > > >
> > > > > > > > > > > > > to make sure new users get the benefit of
> > understanding
> > > > the
> > > > > > > > > fatality
> > > > > > > > > > > > based
> > > > > > > > > > > > > on the info presented by the producer?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <
> > > > > > > info@hiringuru.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Why  we are receiving all emails kindly remove
> us
> > > > from
> > > > > > > > > > > > > > > dev@kafka.apache.org we don't want to receive
> > emails
> > > > > > > > anymore.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <
> > > > > > wangguoz@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks Boyang, yes I think I was confused
> > about the
> > > > > > > > different
> > > > > > > > > > > > > handling
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > two abortTxn calls, and now I get it was not
> > > > > > > intentional. I
> > > > > > > > > > > think I
> > > > > > > > > > > > > do
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > have more concerns.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks for the clarification Guozhang, I
> got
> > your
> > > > > > point
> > > > > > > > > that
> > > > > > > > > > we
> > > > > > > > > > > > > want
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > have a consistent handling of fatal
> > exceptions
> > > > > being
> > > > > > > > thrown
> > > > > > > > > > > from
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > abortTxn. I modified the current template
> to
> > move
> > > > > the
> > > > > > > > fatal
> > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > > try-catch outside of the processing loop to
> > make
> > > > > sure
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > > get
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > chance
> > > > > > > > > > > > > > > > > to close consumer/producer modules. Let me
> > know
> > > > > what
> > > > > > > you
> > > > > > > > > > think.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang
> Chen
> > <
> > > > > > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > My understanding is that abortTransaction
> > would
> > > > > > only
> > > > > > > > > throw
> > > > > > > > > > > when
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > producer is in fatal state. For
> > CommitFailed,
> > > > the
> > > > > > > > > producer
> > > > > > > > > > > > should
> > > > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > > in the abortable error state, so that
> > > > > > > abortTransaction
> > > > > > > > > call
> > > > > > > > > > > > would
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > throw.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang
> > Wang
> > > > <
> > > > > > > > > > > > > wangguoz@gmail.com
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >> Or are you going to maintain some
> internal
> > > > state
> > > > > > > such
> > > > > > > > > > that,
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> `abortTransaction` in the catch block
> > would
> > > > > never
> > > > > > > > throw
> > > > > > > > > > > again?
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM
> Guozhang
> > > > Wang <
> > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > I've also thought about this (i.e.
> using
> > > > > > > > CommitFailed
> > > > > > > > > > for
> > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > >> non-fatal),
> > > > > > > > > > > > > > > > > >> > but what I'm pondering is that, in the
> > catch
> > > > > > > > > > > (CommitFailed)
> > > > > > > > > > > > > > block,
> > > > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > >> > would happen if the
> > > > > > `producer.abortTransaction();`
> > > > > > > > > > throws
> > > > > > > > > > > > > again?
> > > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > > >> > that be captured as a fatal and cause
> > the
> > > > > client
> > > > > > > to
> > > > > > > > > > close
> > > > > > > > > > > > > again.
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > If yes, then naively the pattern would
> > be:
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > > > > > > > > > >> >         // Transaction commit failed
> > with
> > > > > > > abortable
> > > > > > > > > > error,
> > > > > > > > > > > > > user
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > >> reset
> > > > > > > > > > > > > > > > > >> >         // the application state and
> > resume
> > > > > > with a
> > > > > > > > new
> > > > > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > >> > root
> > > > > > > > > > > > > > > > > >> >         // cause was wrapped in the
> > thrown
> > > > > > > > exception.
> > > > > > > > > > > > > > > > > >> >
> > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > > > >> >         try {
> > > > > > > > > > > > > > > > > >> >
>  producer.abortTransaction();
> > > > > > > > > > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > > > > > > > > > >> >             producer.close();
> > > > > > > > > > > > > > > > > >> >             consumer.close();
> > > > > > > > > > > > > > > > > >> >             throw e;
> > > > > > > > > > > > > > > > > >> >         }
> > > > > > > > > > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > > > > > > > > > >> >         producer.close();
> > > > > > > > > > > > > > > > > >> >         consumer.close();
> > > > > > > > > > > > > > > > > >> >         throw e;
> > > > > > > > > > > > > > > > > >> >     }
> > > > > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > Guozhang
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM
> Boyang
> > > > Chen <
> > > > > > > > > > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> Jason and I were discussing the new
> API
> > > > > offline
> > > > > > > and
> > > > > > > > > > > decided
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > take
> > > > > > > > > > > > > > > > > >> >> another
> > > > > > > > > > > > > > > > > >> >> approach. Firstly, the reason not to
> > > > invent a
> > > > > > new
> > > > > > > > API
> > > > > > > > > > > with
> > > > > > > > > > > > > > > returned
> > > > > > > > > > > > > > > > > >> >> boolean
> > > > > > > > > > > > > > > > > >> >> flag is for compatibility
> > consideration,
> > > > > since
> > > > > > > old
> > > > > > > > > EOS
> > > > > > > > > > > code
> > > > > > > > > > > > > > > would not
> > > > > > > > > > > > > > > > > >> know
> > > > > > > > > > > > > > > > > >> >> that a given transaction commit was
> > failed
> > > > > > > > internally
> > > > > > > > > > as
> > > > > > > > > > > > they
> > > > > > > > > > > > > > > don't
> > > > > > > > > > > > > > > > > >> listen
> > > > > > > > > > > > > > > > > >> >> to the output flag. Our proposed
> > > > alternative
> > > > > > > > solution
> > > > > > > > > > is
> > > > > > > > > > > to
> > > > > > > > > > > > > let
> > > > > > > > > > > > > > > > > >> >> *commitTransaction
> > > > > > > > > > > > > > > > > >> >> throw CommitFailedException whenever
> > the
> > > > > commit
> > > > > > > > > failed
> > > > > > > > > > > with
> > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > >> >> exception*, so that on the caller
> side
> > the
> > > > > > > handling
> > > > > > > > > > logic
> > > > > > > > > > > > > > > becomes:
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> try {
> > > > > > > > > > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > > > > > > > > > >> >>
> >  producer.commitTransaction();
> > > > > > > > > > > > > > > > > >> >>         } else {
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > > > >> >>
> >  producer.abortTransaction();
> > > > > > > > > > > > > > > > > >> >>         }
> > > > > > > > > > > > > > > > > >> >>     } catch (CommitFailedException
> e) {
> > > > > > > > > > > > > > > > > >> >>         // Transaction commit failed
> > with
> > > > > > > abortable
> > > > > > > > > > > error,
> > > > > > > > > > > > > user
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > >> >> reset
> > > > > > > > > > > > > > > > > >> >>         // the application state and
> > resume
> > > > > > with
> > > > > > > a
> > > > > > > > > new
> > > > > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > >> >> root
> > > > > > > > > > > > > > > > > >> >>         // cause was wrapped in the
> > thrown
> > > > > > > > exception.
> > > > > > > > > > > > > > > > > >> >>
> > > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > > > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > > > > > > > > > >> >>         producer.close();
> > > > > > > > > > > > > > > > > >> >>         consumer.close();
> > > > > > > > > > > > > > > > > >> >>         throw e;
> > > > > > > > > > > > > > > > > >> >>     }
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> This approach looks cleaner as all
> > > > exception
> > > > > > > types
> > > > > > > > > > other
> > > > > > > > > > > > than
> > > > > > > > > > > > > > > > > >> CommitFailed
> > > > > > > > > > > > > > > > > >> >> will doom to be fatal, which is very
> > easy
> > > > to
> > > > > > > adopt
> > > > > > > > > for
> > > > > > > > > > > > users.
> > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> >> meantime, we still maintain the
> > commitTxn
> > > > > > > behavior
> > > > > > > > to
> > > > > > > > > > > throw
> > > > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > >> >> silently failing.
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> In addition, we decided to drop the
> > > > > > > recommendation
> > > > > > > > to
> > > > > > > > > > > > handle
> > > > > > > > > > > > > > > > > >> >> TimeoutException and leave it to the
> > users
> > > > to
> > > > > > > make
> > > > > > > > > the
> > > > > > > > > > > > call.
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > > >> downside
> > > > > > > > > > > > > > > > > >> >> for blindly calling abortTxn upon
> > timeout
> > > > is
> > > > > > that
> > > > > > > > we
> > > > > > > > > > > could
> > > > > > > > > > > > > > > result in
> > > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > >> >> illegal state when the commit was
> > already
> > > > > > > > successful
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > > >> >> side. Without a good guarantee on the
> > > > > outcome,
> > > > > > > > > > > > > overcomplicating
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> >> template should not be encouraged
> IMHO.
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> Let me know your thoughts on the new
> > > > approach
> > > > > > > here,
> > > > > > > > > > thank
> > > > > > > > > > > > > you!
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> Best,
> > > > > > > > > > > > > > > > > >> >> Boyang
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM
> > Guozhang
> > > > > Wang
> > > > > > <
> > > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > > > >> >> wrote:
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > Thanks for your clarification on
> > 2)/3),
> > > > > that
> > > > > > > > makes
> > > > > > > > > > > sense.
> > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM
> > Boyang
> > > > > Chen
> > > > > > <
> > > > > > > > > > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > >> >> > > Thanks for the input Guozhang,
> > replied
> > > > > > > inline.
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM
> > > > Guozhang
> > > > > > > Wang <
> > > > > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > Thanks for the updated KIP. I
> > read it
> > > > > > again
> > > > > > > > and
> > > > > > > > > > > have
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> following
> > > > > > > > > > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > 0. I'm a bit concerned that if
> > > > > commitTxn
> > > > > > > does
> > > > > > > > > not
> > > > > > > > > > > > throw
> > > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > > >> >> non-fatal
> > > > > > > > > > > > > > > > > >> >> > > > exception, and instead we rely
> > on the
> > > > > > > > > subsequent
> > > > > > > > > > > > > beginTxn
> > > > > > > > > > > > > > > call
> > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > >> >> > throw,
> > > > > > > > > > > > > > > > > >> >> > > it
> > > > > > > > > > > > > > > > > >> >> > > > may violate some callers with a
> > > > pattern
> > > > > > > that
> > > > > > > > > > > relying
> > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > >> commitTxn to
> > > > > > > > > > > > > > > > > >> >> > > > succeed to make some
> non-rollback
> > > > > > > operations.
> > > > > > > > > For
> > > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > > > > > > > > > >> >> > > > // do some read-write on my
> > local DB
> > > > > > > > > > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > > > > > > > > > >> >> > > > // if commitTxn succeeds, then
> > commit
> > > > > the
> > > > > > > DB
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > -------------
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > The issue is that, committing
> DB
> > is a
> > > > > > > > > > non-rollback
> > > > > > > > > > > > > > > operation,
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > >> >> users
> > > > > > > > > > > > > > > > > >> >> > > may
> > > > > > > > > > > > > > > > > >> >> > > > just rely on commitTxn to
> return
> > > > > without
> > > > > > > > error
> > > > > > > > > to
> > > > > > > > > > > > make
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > >> >> > non-rollback
> > > > > > > > > > > > > > > > > >> >> > > > call. Of course we can just
> claim
> > > > this
> > > > > > > > pattern
> > > > > > > > > is
> > > > > > > > > > > not
> > > > > > > > > > > > > > > > > legitimate
> > > > > > > > > > > > > > > > > >> >> and is
> > > > > > > > > > > > > > > > > >> >> > > not
> > > > > > > > > > > > > > > > > >> >> > > > the right way of doing things,
> > but
> > > > many
> > > > > > > users
> > > > > > > > > may
> > > > > > > > > > > > > > naturally
> > > > > > > > > > > > > > > > > adopt
> > > > > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > > > > >> >> > > > pattern.
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > So maybe we should still let
> > > > commitTxn
> > > > > > also
> > > > > > > > > throw
> > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > >> >> exceptions,
> > > > > > > > > > > > > > > > > >> >> > > in
> > > > > > > > > > > > > > > > > >> >> > > > which case we would then call
> > > > abortTxn
> > > > > > > again.
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > But if we do this, the pattern
> > > > becomes:
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > > > > > > > > > >> >> > > >     try {
> > > > > > > > > > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > > > > > > > > > >> >> > > >     } catch (...) {        //
> > > > enumerate
> > > > > > all
> > > > > > > > > fatal
> > > > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > if (!shouldCommit &&
> running()) {
> > > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > Thank makes me think, the
> > alternative
> > > > > we
> > > > > > > have
> > > > > > > > > > > > discussed
> > > > > > > > > > > > > > > offline
> > > > > > > > > > > > > > > > > >> may
> > > > > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > > > > >> >> > > > better: let commitTxn() to
> > return a
> > > > > > boolean
> > > > > > > > > flag.
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > * If it returns true, it means
> > the
> > > > > commit
> > > > > > > > > > > succeeded.
> > > > > > > > > > > > > > Users
> > > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > > >> >> > > comfortably
> > > > > > > > > > > > > > > > > >> >> > > > continue and do any external
> > > > > non-rollback
> > > > > > > > > > > operations
> > > > > > > > > > > > if
> > > > > > > > > > > > > > > they
> > > > > > > > > > > > > > > > > >> like.
> > > > > > > > > > > > > > > > > >> >> > > > * If it returns false, it means
> > the
> > > > > > commit
> > > > > > > > > failed
> > > > > > > > > > > > with
> > > > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > >> >> error
> > > > > > > > > > > > > > > > > >> >> > > *AND
> > > > > > > > > > > > > > > > > >> >> > > > the txn has been aborted*.
> Users
> > do
> > > > not
> > > > > > > need
> > > > > > > > to
> > > > > > > > > > > call
> > > > > > > > > > > > > > > abortTxn
> > > > > > > > > > > > > > > > > >> again.
> > > > > > > > > > > > > > > > > >> >> > > > * If it throws, then it means
> > fatal
> > > > > > errors.
> > > > > > > > > Users
> > > > > > > > > > > > > should
> > > > > > > > > > > > > > > shut
> > > > > > > > > > > > > > > > > >> down
> > > > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > > > >> >> > > > client.
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > In this case, the pattern
> > becomes:
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > > > > > > > > > >> >> > > >         commitSucceeded =
> > commitTxn()
> > > > > > > > > > > > > > > > > >> >> > > >     } else {
> > > > > > > > > > > > > > > > > >> >> > > >         // reset offsets,
> > rollback
> > > > > > > > operations,
> > > > > > > > > > etc
> > > > > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > > > > > > > > > >> >> > > >    // do other non-rollbackable
> > > > things
> > > > > > > > > > > > > > > > > >> >> > > > else
> > > > > > > > > > > > > > > > > >> >> > > >    // reset offsets, rollback
> > > > > operations,
> > > > > > > etc
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > Of course, if we want to have
> > better
> > > > > > > > visibility
> > > > > > > > > > > into
> > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > > caused
> > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > >> >> > > commit
> > > > > > > > > > > > > > > > > >> >> > > > to fail and txn to abort. We
> can
> > let
> > > > > the
> > > > > > > > return
> > > > > > > > > > > type
> > > > > > > > > > > > be
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > enum
> > > > > > > > > > > > > > > > > >> >> instead
> > > > > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > > > > >> >> > > > a flag. But my main idea is to
> > still
> > > > > let
> > > > > > > the
> > > > > > > > > > > > commitTxn
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> final
> > > > > > > > > > > > > > > > > >> >> > point
> > > > > > > > > > > > > > > > > >> >> > > > users can tell whether this txn
> > > > > succeeded
> > > > > > > or
> > > > > > > > > not,
> > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > >> >> relying on
> > > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > I agree that adding a boolean
> > flag is
> > > > > > > indeed
> > > > > > > > > > useful
> > > > > > > > > > > > in
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > case.
> > > > > > > > > > > > > > > > > >> >> Will
> > > > > > > > > > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> > > 1. Re: "while maintaining the
> > behavior
> > > > to
> > > > > > > throw
> > > > > > > > > > fatal
> > > > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > >> >> raw"
> > > > > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > > > > >> >> > > > sure what do you mean by
> "throw"
> > > > here.
> > > > > > Are
> > > > > > > > you
> > > > > > > > > > > > > proposing
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> >> callback
> > > > > > > > > > > > > > > > > >> >> > > would
> > > > > > > > > > > > > > > > > >> >> > > > *pass* (not throw) in any fatal
> > > > > > exceptions
> > > > > > > > when
> > > > > > > > > > > > > triggered
> > > > > > > > > > > > > > > > > without
> > > > > > > > > > > > > > > > > >> >> > > wrapping?
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > Yes, I want to say *pass*, the
> > > > benefit
> > > > > is
> > > > > > > to
> > > > > > > > > make
> > > > > > > > > > > the
> > > > > > > > > > > > > end
> > > > > > > > > > > > > > > > > user's
> > > > > > > > > > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> > > > 2. I'm not sure I understand
> the
> > > > claim
> > > > > > > > > regarding
> > > > > > > > > > > the
> > > > > > > > > > > > > > > callback
> > > > > > > > > > > > > > > > > >> that
> > > > > > > > > > > > > > > > > >> >> "In
> > > > > > > > > > > > > > > > > >> >> > > EOS
> > > > > > > > > > > > > > > > > >> >> > > > setup, it is not required to
> > handle
> > > > the
> > > > > > > > > > exception".
> > > > > > > > > > > > Are
> > > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > > >> >> proposing
> > > > > > > > > > > > > > > > > >> >> > > that,
> > > > > > > > > > > > > > > > > >> >> > > > e.g. in Streams, we do not try
> to
> > > > > handle
> > > > > > > any
> > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > EOS
> > > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > >> >> > enabled
> > > > > > > > > > > > > > > > > >> >> > > > in the callback anymore, but
> > just let
> > > > > > > > > commitTxn()
> > > > > > > > > > > > > itself
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > > > > >> >> > > > notified about the problem?
> > > > Personally
> > > > > I
> > > > > > > > think
> > > > > > > > > in
> > > > > > > > > > > > > Streams
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > >> should
> > > > > > > > > > > > > > > > > >> >> > just
> > > > > > > > > > > > > > > > > >> >> > > > make the handling logic of the
> > > > callback
> > > > > > to
> > > > > > > be
> > > > > > > > > > > > > consistent
> > > > > > > > > > > > > > > > > >> regardless
> > > > > > > > > > > > > > > > > >> >> of
> > > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > > >> >> > > > EOS settings (today we have
> > different
> > > > > > logic
> > > > > > > > > > > depending
> > > > > > > > > > > > > on
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > >> logic,
> > > > > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > > > > >> >> > > > I think could be unified as
> > well).
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > My idea originates from the
> > claim on
> > > > > send
> > > > > > > > API:
> > > > > > > > > > > > > > > > > >> >> > > "When used as part of a
> > transaction, it
> > > > > is
> > > > > > > not
> > > > > > > > > > > > necessary
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > define
> > > > > > > > > > > > > > > > > >> a
> > > > > > > > > > > > > > > > > >> >> > > callback or check the result of
> the
> > > > > future
> > > > > > > in
> > > > > > > > > > order
> > > > > > > > > > > to
> > > > > > > > > > > > > > > detect
> > > > > > > > > > > > > > > > > >> errors
> > > > > > > > > > > > > > > > > >> >> > from
> > > > > > > > > > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > > > > > > > > > >> >> > > My understanding is that for EOS,
> > the
> > > > > > > exception
> > > > > > > > > > will
> > > > > > > > > > > be
> > > > > > > > > > > > > > > detected
> > > > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > > >> >> > calling
> > > > > > > > > > > > > > > > > >> >> > > transactional APIs either way, so
> > it
> > > > is a
> > > > > > > > > duplicate
> > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > >> track
> > > > > > > > > > > > > > > > > >> >> > > all the sendExceptions in
> > > > > RecordCollector.
> > > > > > > > > > However, I
> > > > > > > > > > > > > > looked
> > > > > > > > > > > > > > > up
> > > > > > > > > > > > > > > > > >> >> > > sendException is being used today
> > as
> > > > > > follow:
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> > > 1. Pass to
> > "ProductionExceptionHandler"
> > > > > for
> > > > > > > any
> > > > > > > > > > > default
> > > > > > > > > > > > > or
> > > > > > > > > > > > > > > > > >> customized
> > > > > > > > > > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > > > > > > > > > >> >> > > 2. Stop collecting offset info or
> > new
> > > > > > > > exceptions
> > > > > > > > > > > > > > > > > >> >> > > 3. Check and rethrow exceptions
> in
> > > > > close(),
> > > > > > > > > flush()
> > > > > > > > > > > or
> > > > > > > > > > > > > new
> > > > > > > > > > > > > > > send()
> > > > > > > > > > > > > > > > > >> >> calls
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> > > To my understanding, we should
> > still
> > > > > honor
> > > > > > > the
> > > > > > > > > > > > commitment
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > #1
> > > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > >> >> any
> > > > > > > > > > > > > > > > > >> >> > > user customized implementation.
> > The #2
> > > > > does
> > > > > > > not
> > > > > > > > > > > really
> > > > > > > > > > > > > > > affect our
> > > > > > > > > > > > > > > > > >> >> > decision
> > > > > > > > > > > > > > > > > >> >> > > upon EOS. The #3 here is still
> > valuable
> > > > > as
> > > > > > it
> > > > > > > > > could
> > > > > > > > > > > > help
> > > > > > > > > > > > > us
> > > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > > > >> fast
> > > > > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > > > > >> >> > > new send() instead of waiting to
> > later
> > > > > > stage
> > > > > > > of
> > > > > > > > > > > > > processing.
> > > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > >> >> > sense,
> > > > > > > > > > > > > > > > > >> >> > > I agree we should continue to
> > record
> > > > send
> > > > > > > > > > exceptions
> > > > > > > > > > > > even
> > > > > > > > > > > > > > > under
> > > > > > > > > > > > > > > > > EOS
> > > > > > > > > > > > > > > > > >> >> case
> > > > > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > > > > >> >> > > ensure the strength of stream
> side
> > > > > Producer
> > > > > > > > > logic.
> > > > > > > > > > On
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > safer
> > > > > > > > > > > > > > > > > >> side,
> > > > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > > > >> >> > no
> > > > > > > > > > > > > > > > > >> >> > > longer need to wrap certain fatal
> > > > > > exceptions
> > > > > > > > like
> > > > > > > > > > > > > > > ProducerFenced
> > > > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > > >> >> > > TaskMigrated, since they should
> not
> > > > crash
> > > > > > the
> > > > > > > > > > stream
> > > > > > > > > > > > > thread
> > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > >> thrown
> > > > > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > > > > >> >> > > raw format, once we adopt the new
> > > > > > processing
> > > > > > > > > model
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > send
> > > > > > > > > > > > > > > > > >> phase.
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM
> > > > Boyang
> > > > > > > Chen <
> > > > > > > > > > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > > Thanks for everyone's
> feedback
> > so
> > > > > far.
> > > > > > I
> > > > > > > > have
> > > > > > > > > > > > > polished
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > > > > >> >> after
> > > > > > > > > > > > > > > > > >> >> > > > > offline discussion with some
> > folks
> > > > > > > working
> > > > > > > > on
> > > > > > > > > > EOS
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > make the
> > > > > > > > > > > > > > > > > >> >> > exception
> > > > > > > > > > > > > > > > > >> >> > > > > handling more lightweight.
> The
> > > > > > essential
> > > > > > > > > change
> > > > > > > > > > > is
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > >> not
> > > > > > > > > > > > > > > > > >> >> > > > > inventing a new intermediate
> > > > > exception
> > > > > > > > type,
> > > > > > > > > > but
> > > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > > > >> >> separating a
> > > > > > > > > > > > > > > > > >> >> > > > full
> > > > > > > > > > > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > 1. The data transmission
> phase
> > > > > > > > > > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > This way for any exception
> > thrown
> > > > > from
> > > > > > > > phase
> > > > > > > > > 1,
> > > > > > > > > > > > will
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > >> >> indicator
> > > > > > > > > > > > > > > > > >> >> > in
> > > > > > > > > > > > > > > > > >> >> > > > > phase 2 whether we should do
> > commit
> > > > > or
> > > > > > > > abort,
> > > > > > > > > > and
> > > > > > > > > > > > > from
> > > > > > > > > > > > > > > now on
> > > > > > > > > > > > > > > > > >> >> > > > > `commitTransaction` should
> only
> > > > throw
> > > > > > > fatal
> > > > > > > > > > > > > exceptions,
> > > > > > > > > > > > > > > > > >> similar to
> > > > > > > > > > > > > > > > > >> >> > > > > `abortTransaction`, so that
> any
> > > > > > > > > KafkaException
> > > > > > > > > > > > caught
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> >> commit
> > > > > > > > > > > > > > > > > >> >> > > phase
> > > > > > > > > > > > > > > > > >> >> > > > > will be definitely fatal to
> > crash
> > > > the
> > > > > > > app.
> > > > > > > > > For
> > > > > > > > > > > more
> > > > > > > > > > > > > > > advanced
> > > > > > > > > > > > > > > > > >> users
> > > > > > > > > > > > > > > > > >> >> > such
> > > > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > > > >> >> > > > > Streams, we have the ability
> to
> > > > > further
> > > > > > > > wrap
> > > > > > > > > > > > selected
> > > > > > > > > > > > > > > types
> > > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > > > > >> >> > > > > exceptions to trigger task
> > > > migration
> > > > > if
> > > > > > > > > > > necessary.
> > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > More details in the KIP, feel
> > free
> > > > to
> > > > > > > take
> > > > > > > > > > > another
> > > > > > > > > > > > > > look,
> > > > > > > > > > > > > > > > > >> thanks!
> > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36
> PM
> > > > > Boyang
> > > > > > > > Chen <
> > > > > > > > > > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > > Thanks Bruno for the
> > feedback.
> > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26
> > AM
> > > > > Bruno
> > > > > > > > > Cadonna
> > > > > > > > > > <
> > > > > > > > > > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> Like Matthias, I do also
> not
> > > > know
> > > > > > the
> > > > > > > > > > producer
> > > > > > > > > > > > > > > internal
> > > > > > > > > > > > > > > > > well
> > > > > > > > > > > > > > > > > >> >> > enough
> > > > > > > > > > > > > > > > > >> >> > > to
> > > > > > > > > > > > > > > > > >> >> > > > > >> comment on the
> > categorization.
> > > > > > > However,
> > > > > > > > I
> > > > > > > > > > > think
> > > > > > > > > > > > > > > having a
> > > > > > > > > > > > > > > > > >> super
> > > > > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > > > > >> >> > > > > >> (e.g. RetriableException)
> > that
> > > > > > > encodes
> > > > > > > > if
> > > > > > > > > > an
> > > > > > > > > > > > > > > exception is
> > > > > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > > > > >> >> > > > not
> > > > > > > > > > > > > > > > > >> >> > > > > >> is cleaner, better
> > > > understandable
> > > > > > and
> > > > > > > > less
> > > > > > > > > > > > > > > error-prone,
> > > > > > > > > > > > > > > > > >> because
> > > > > > > > > > > > > > > > > >> >> > > > ideally
> > > > > > > > > > > > > > > > > >> >> > > > > >> when you add a new
> non-fatal
> > > > > > exception
> > > > > > > > in
> > > > > > > > > > > future
> > > > > > > > > > > > > you
> > > > > > > > > > > > > > > just
> > > > > > > > > > > > > > > > > >> need
> > > > > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > > > > >> >> > > > > >> about letting it inherit
> > from
> > > > the
> > > > > > > super
> > > > > > > > > > > > exception
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > >> >> rest
> > > > > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > > > > >> >> > > > > >> the code will just behave
> > > > > correctly
> > > > > > > > > without
> > > > > > > > > > > the
> > > > > > > > > > > > > need
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > wrap
> > > > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > > > >> >> > new
> > > > > > > > > > > > > > > > > >> >> > > > > >> exception into another
> > exception
> > > > > > each
> > > > > > > > time
> > > > > > > > > > it
> > > > > > > > > > > is
> > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > > >> (maybe
> > > > > > > > > > > > > > > > > >> >> it
> > > > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > > > >> >> > > > > >> thrown at different
> > location in
> > > > > the
> > > > > > > > code).
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> As far as I understand the
> > > > > following
> > > > > > > > > > statement
> > > > > > > > > > > > > from
> > > > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > > >> >> previous
> > > > > > > > > > > > > > > > > >> >> > > > e-mail
> > > > > > > > > > > > > > > > > >> >> > > > > >> is the reason that
> currently
> > > > such
> > > > > a
> > > > > > > > super
> > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > is not
> > > > > > > > > > > > > > > > > >> >> > possible:
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> "Right now we have
> > > > > > RetriableException
> > > > > > > > > type,
> > > > > > > > > > if
> > > > > > > > > > > > we
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > going
> > > > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > > > >> >> > add a
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> `ProducerRetriableException`
> > > > type,
> > > > > > we
> > > > > > > > have
> > > > > > > > > > to
> > > > > > > > > > > > put
> > > > > > > > > > > > > > > this new
> > > > > > > > > > > > > > > > > >> >> > interface
> > > > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > > > >> >> > > > > >> the parent of the
> > > > > > RetriableException,
> > > > > > > > > > because
> > > > > > > > > > > > not
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > > > > >> >> > > > > >> exceptions are `retriable`
> > in
> > > > > > general
> > > > > > > > for
> > > > > > > > > > > > > producer"
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> In the list of exceptions
> in
> > > > your
> > > > > > > KIP, I
> > > > > > > > > > found
> > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > >> >> > exceptions
> > > > > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > > > > >> >> > > > > >> do not inherit from
> > > > > > > RetriableException.
> > > > > > > > I
> > > > > > > > > > > guess
> > > > > > > > > > > > > > those
> > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > >> >> ones
> > > > > > > > > > > > > > > > > >> >> > > you
> > > > > > > > > > > > > > > > > >> >> > > > > >> are referring to in your
> > > > > statement:
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > InvalidProducerEpochException
> > > > > > > > > > > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> TransactionAbortedException
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> All of those exceptions
> are
> > > > > > non-fatal
> > > > > > > > and
> > > > > > > > > do
> > > > > > > > > > > not
> > > > > > > > > > > > > > > inherit
> > > > > > > > > > > > > > > > > >> from
> > > > > > > > > > > > > > > > > >> >> > > > > >> RetriableException. Is
> > there a
> > > > > > reason
> > > > > > > > for
> > > > > > > > > > > that?
> > > > > > > > > > > > If
> > > > > > > > > > > > > > > they
> > > > > > > > > > > > > > > > > >> >> depended
> > > > > > > > > > > > > > > > > >> >> > > from
> > > > > > > > > > > > > > > > > >> >> > > > > >> RetriableException we
> would
> > be a
> > > > > bit
> > > > > > > > > closer
> > > > > > > > > > > to a
> > > > > > > > > > > > > > super
> > > > > > > > > > > > > > > > > >> >> exception I
> > > > > > > > > > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> The reason is that sender
> > may
> > > > > catch
> > > > > > > > those
> > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> >> > > > > > ProduceResponse, and it
> > currently
> > > > > > does
> > > > > > > > > > infinite
> > > > > > > > > > > > > > > > > >> >> > > > > > retries on
> > RetriableException. To
> > > > > > make
> > > > > > > > sure
> > > > > > > > > > we
> > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > trigger
> > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > >> >> > > > > > abortTransaction() by
> > catching
> > > > > > > non-fatal
> > > > > > > > > > thrown
> > > > > > > > > > > > > > > > > >> >> > > > > > exceptions and reinitialize
> > the
> > > > txn
> > > > > > > > state,
> > > > > > > > > we
> > > > > > > > > > > > chose
> > > > > > > > > > > > > > > not to
> > > > > > > > > > > > > > > > > >> let
> > > > > > > > > > > > > > > > > >> >> > those
> > > > > > > > > > > > > > > > > >> >> > > > > > exceptions inherit
> > > > > RetriableException
> > > > > > > so
> > > > > > > > > that
> > > > > > > > > > > > > > > > > >> >> > > > > > they won't cause infinite
> > retry
> > > > on
> > > > > > > > sender.
> > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > >> With
> > OutOfOrderSequenceException
> > > > > and
> > > > > > > > > > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > > > > >> >> > > > > >> to understand that their
> > > > fatality
> > > > > > > > depends
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > > > > type
> > > > > > > > > > > > > > > > > (i.e.
> > > > > > > > > > > > > > > > > >> >> > > > > >> configuration) of the
> > producer.
> > > > > That
> > > > > > > > makes
> > > > > > > > > > it
> > > > > > > > > > > > > > > difficult to
> > > > > > > > > > > > > > > > > >> >> have a
> > > > > > > > > > > > > > > > > >> >> > > > super
> > > > > > > > > > > > > > > > > >> >> > > > > >> exception that encodes the
> > > > > > > retriability
> > > > > > > > as
> > > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > > above.
> > > > > > > > > > > > > > > > > >> >> Would
> > > > > > > > > > > > > > > > > >> >> > it
> > > > > > > > > > > > > > > > > >> >> > > > be
> > > > > > > > > > > > > > > > > >> >> > > > > >> possible to introduce
> > exceptions
> > > > > > that
> > > > > > > > > > inherit
> > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > >> >> > > RetriableException
> > > > > > > > > > > > > > > > > >> >> > > > > >> and that are thrown when
> > those
> > > > > > > > exceptions
> > > > > > > > > > are
> > > > > > > > > > > > > caught
> > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > >> >> > > brokers
> > > > > > > > > > > > > > > > > >> >> > > > > >> and the type of the
> > producer is
> > > > > such
> > > > > > > > that
> > > > > > > > > > the
> > > > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > > >> >> > > > retriable?
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> Yea, I think in general
> the
> > > > > > exception
> > > > > > > > type
> > > > > > > > > > > > mixing
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > > >> difficult
> > > > > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > > > > >> >> > > get
> > > > > > > > > > > > > > > > > >> >> > > > > > them all right. I have
> > already
> > > > > > proposed
> > > > > > > > > > another
> > > > > > > > > > > > > > > solution
> > > > > > > > > > > > > > > > > >> based
> > > > > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > > > > >> >> > my
> > > > > > > > > > > > > > > > > >> >> > > > > > offline discussion with
> some
> > > > folks
> > > > > > > > working
> > > > > > > > > on
> > > > > > > > > > > EOS
> > > > > > > > > > > > > > > > > >> >> > > > > > to make the handling more
> > > > > > > straightforward
> > > > > > > > > for
> > > > > > > > > > > end
> > > > > > > > > > > > > > users
> > > > > > > > > > > > > > > > > >> without
> > > > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > > > >> >> > > > need
> > > > > > > > > > > > > > > > > >> >> > > > > > to distinguish exception
> > > > fatality.
> > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34,
> Guozhang
> > Wang
> > > > > > > wrote:
> > > > > > > > > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the
> > > > proposal!
> > > > > I
> > > > > > > > made a
> > > > > > > > > > > pass
> > > > > > > > > > > > > over
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> list
> > > > > > > > > > > > > > > > > >> >> and
> > > > > > > > > > > > > > > > > >> >> > > here
> > > > > > > > > > > > > > > > > >> >> > > > > are
> > > > > > > > > > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> > 0) Although this is not
> > part
> > > > of
> > > > > > the
> > > > > > > > > public
> > > > > > > > > > > > API,
> > > > > > > > > > > > > I
> > > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > >> >> > should
> > > > > > > > > > > > > > > > > >> >> > > > make
> > > > > > > > > > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > > > > > > > > > >> >> > > > > >> > that the suggested
> pattern
> > > > (i.e.
> > > > > > > user
> > > > > > > > > can
> > > > > > > > > > > > always
> > > > > > > > > > > > > > > call
> > > > > > > > > > > > > > > > > >> >> abortTxn()
> > > > > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > > > > >> >> > > > > >> > handling non-fatal
> > errors) are
> > > > > > > indeed
> > > > > > > > > > > > supported.
> > > > > > > > > > > > > > > E.g. if
> > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > >> >> txn
> > > > > > > > > > > > > > > > > >> >> > > is
> > > > > > > > > > > > > > > > > >> >> > > > > >> already
> > > > > > > > > > > > > > > > > >> >> > > > > >> > aborted by the broker
> > side,
> > > > then
> > > > > > > users
> > > > > > > > > can
> > > > > > > > > > > > still
> > > > > > > > > > > > > > > call
> > > > > > > > > > > > > > > > > >> >> abortTxn
> > > > > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > > > > >> >> > > > > >> would
> > > > > > > > > > > > > > > > > >> >> > > > > >> > not throw another
> > exception
> > > > but
> > > > > > just
> > > > > > > > be
> > > > > > > > > > > > treated
> > > > > > > > > > > > > > as a
> > > > > > > > > > > > > > > > > >> no-op.
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> > 1)
> > > > > > > *ConcurrentTransactionsException*:
> > > > > > > > I
> > > > > > > > > > > think
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > error
> > > > > > > > > > > > > > > > > >> can
> > > > > > > > > > > > > > > > > >> >> > also
> > > > > > > > > > > > > > > > > >> >> > > be
> > > > > > > > > > > > > > > > > >> >> > > > > >> > returned but not
> > documented
> > > > yet.
> > > > > > > This
> > > > > > > > > > should
> > > > > > > > > > > > be
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > > > > >> >> > error.
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> > 2)
> > *InvalidTxnStateException*:
> > > > > > this
> > > > > > > > > error
> > > > > > > > > > is
> > > > > > > > > > > > > > > returned
> > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > >> >> > broker
> > > > > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > > > > > > > > > >> >> > > > > >> > state transition failed
> > (e.g.
> > > > it
> > > > > > is
> > > > > > > > > trying
> > > > > > > > > > > to
> > > > > > > > > > > > > > > transit to
> > > > > > > > > > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > > > > > > > > > >> >> > > > > >> > while the current state
> > is not
> > > > > > > > > > > > prepare-commit).
> > > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > > error
> > > > > > > > > > > > > > > > > >> >> could
> > > > > > > > > > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > > > > > > > > > >> >> > > > > >> > a bug on the client
> > internal
> > > > > code
> > > > > > or
> > > > > > > > the
> > > > > > > > > > > > broker
> > > > > > > > > > > > > > > code,
> > > > > > > > > > > > > > > > > OR a
> > > > > > > > > > > > > > > > > >> >> user
> > > > > > > > > > > > > > > > > >> >> > > > error
> > > > > > > > > > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > > > > > > > > > >> >> > > > > >> > similar error is
> > > > > > > > > > > > > ConcurrentTransactionsException,
> > > > > > > > > > > > > > > i.e.
> > > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > >> >> Kafka
> > > > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > > > > > > > > > >> >> > > > > >> > these exceptions would
> > only be
> > > > > > > > returned
> > > > > > > > > if
> > > > > > > > > > > > users
> > > > > > > > > > > > > > > try to
> > > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > > >> >> > > something
> > > > > > > > > > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn
> > right
> > > > > after
> > > > > > a
> > > > > > > > > > > commitTxn,
> > > > > > > > > > > > > > etc.
> > > > > > > > > > > > > > > So
> > > > > > > > > > > > > > > > > I'm
> > > > > > > > > > > > > > > > > >> >> > > thinking
> > > > > > > > > > > > > > > > > >> >> > > > it
> > > > > > > > > > > > > > > > > >> >> > > > > >> > should be a non-fatal
> > error
> > > > > > instead
> > > > > > > > of a
> > > > > > > > > > > fatal
> > > > > > > > > > > > > > > error,
> > > > > > > > > > > > > > > > > >> wdyt?
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*:
> case
> > i
> > > > > > > "indicates
> > > > > > > > > > fatal
> > > > > > > > > > > > > > > > > transactional
> > > > > > > > > > > > > > > > > >> >> > > sequence
> > > > > > > > > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit
> > > > > > conflicting
> > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > > > > > > > > > >> >> > > > > >> that
> > > > > > > > > > > > > > > > > >> >> > > > > >> > is treated as
> non-fatal. I
> > > > guess
> > > > > > > your
> > > > > > > > > > > proposal
> > > > > > > > > > > > > is
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > OutOfOrderSequenceException
> > > > > would
> > > > > > be
> > > > > > > > > > treated
> > > > > > > > > > > > > > either
> > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > > >> fatal
> > > > > > > > > > > > > > > > > >> >> > with
> > > > > > > > > > > > > > > > > >> >> > > > > >> > transactional producer,
> or
> > > > > > non-fatal
> > > > > > > > > with
> > > > > > > > > > > > > > idempotent
> > > > > > > > > > > > > > > > > >> >> producer,
> > > > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > > > > >> >> > > > > >> > right? If the producer
> is
> > only
> > > > > > > > > configured
> > > > > > > > > > > with
> > > > > > > > > > > > > > > > > idempotency
> > > > > > > > > > > > > > > > > >> >> but
> > > > > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > > > > >> >> > > > > >> > transaction, then
> > throwing a
> > > > > > > > > > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > > > > > > > > > >> >> > > for
> > > > > > > > > > > > > > > > > >> >> > > > > >> > non-fatal errors would
> be
> > > > > > confusing
> > > > > > > > > since
> > > > > > > > > > > > users
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > > >> using
> > > > > > > > > > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > > > > > > > > > >> >> > > > > >> > at all.. So I suggest we
> > > > always
> > > > > > > throw
> > > > > > > > > > > > > > > > > >> OutOfSequenceException
> > > > > > > > > > > > > > > > > >> >> > as-is
> > > > > > > > > > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > > > > > > > > > >> >> > > > > >> > not wrapped) no matter
> > how the
> > > > > > > > producer
> > > > > > > > > is
> > > > > > > > > > > > > > > configured,
> > > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > >> >> let
> > > > > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > > > > > > > > > >> >> > > > > >> > decide how to handle it
> > based
> > > > on
> > > > > > > > whether
> > > > > > > > > > it
> > > > > > > > > > > is
> > > > > > > > > > > > > > only
> > > > > > > > > > > > > > > > > >> >> idempotent
> > > > > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn
> > APIs,
> > > > the
> > > > > > > > > `send()`
> > > > > > > > > > > > > > callback /
> > > > > > > > > > > > > > > > > >> future
> > > > > > > > > > > > > > > > > >> >> can
> > > > > > > > > > > > > > > > > >> >> > > also
> > > > > > > > > > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > > > > > > > > > >> >> > > > > >> > txn-related exceptions,
> I
> > > > think
> > > > > > this
> > > > > > > > KIP
> > > > > > > > > > > > should
> > > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > > cover
> > > > > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > > > > >> >> > > API
> > > > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> > 5) This is related to
> 1/2)
> > > > > above:
> > > > > > > > > > sometimes
> > > > > > > > > > > > > those
> > > > > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > > > > >> >> > > > > like
> > > > > > > > > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or
> > > > InvalidTxnState
> > > > > > are
> > > > > > > > not
> > > > > > > > > > due
> > > > > > > > > > > > to
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > state
> > > > > > > > > > > > > > > > > >> >> being
> > > > > > > > > > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > > > > > > > > > >> >> > > > > >> > at the broker side, but
> > maybe
> > > > > > users
> > > > > > > > are
> > > > > > > > > > > doing
> > > > > > > > > > > > > > > something
> > > > > > > > > > > > > > > > > >> >> wrong.
> > > > > > > > > > > > > > > > > >> >> > So
> > > > > > > > > > > > > > > > > >> >> > > > I'm
> > > > > > > > > > > > > > > > > >> >> > > > > >> > wondering if we should
> > further
> > > > > > > > > distinguish
> > > > > > > > > > > > those
> > > > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > > > > >> >> > > > > >> between
> > > > > > > > > > > > > > > > > >> >> > > > > >> > a) those that are caused
> > by
> > > > > Kafka
> > > > > > > > > itself,
> > > > > > > > > > > > e.g. a
> > > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > > >> timed
> > > > > > > > > > > > > > > > > >> >> > out
> > > > > > > > > > > > > > > > > >> >> > > > and
> > > > > > > > > > > > > > > > > >> >> > > > > >> > aborted a txn and later
> an
> > > > > endTxn
> > > > > > > > > request
> > > > > > > > > > is
> > > > > > > > > > > > > > > received,
> > > > > > > > > > > > > > > > > >> and b)
> > > > > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > > > > >> >> > > > > user's
> > > > > > > > > > > > > > > > > >> >> > > > > >> > API call pattern is
> > incorrect,
> > > > > > > causing
> > > > > > > > > the
> > > > > > > > > > > > > request
> > > > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > > >> >> > rejected
> > > > > > > > > > > > > > > > > >> >> > > > with
> > > > > > > > > > > > > > > > > >> >> > > > > >> an
> > > > > > > > > > > > > > > > > >> >> > > > > >> > error code from the
> > broker.
> > > > > > > > > > > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > > > > > > > > > > >> >> > > > feels
> > > > > > > > > > > > > > > > > >> >> > > > > >> to
> > > > > > > > > > > > > > > > > >> >> > > > > >> > me more like for case
> a),
> > but
> > > > > not
> > > > > > > case
> > > > > > > > > b).
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at
> > 4:50 PM
> > > > > > > Boyang
> > > > > > > > > > Chen <
> > > > > > > > > > > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I
> think
> > your
> > > > > > > > proposal
> > > > > > > > > > > makes
> > > > > > > > > > > > > > sense
> > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > > >> well,
> > > > > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > > >> >> > > > > pro
> > > > > > > > > > > > > > > > > >> >> > > > > >> side
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> we could have a
> > universally
> > > > > > agreed
> > > > > > > > > > > exception
> > > > > > > > > > > > > type
> > > > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > > >> >> caught
> > > > > > > > > > > > > > > > > >> >> > by
> > > > > > > > > > > > > > > > > >> >> > > > the
> > > > > > > > > > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> without having an extra
> > layer
> > > > > on
> > > > > > > top
> > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > > > > > actual
> > > > > > > > > > > > > > > > > >> >> exceptions.
> > > > > > > > > > > > > > > > > >> >> > I
> > > > > > > > > > > > > > > > > >> >> > > > > could
> > > > > > > > > > > > > > > > > >> >> > > > > >> see
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> some issue on
> downsides:
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> 1. The exception
> > hierarchy
> > > > will
> > > > > > be
> > > > > > > > more
> > > > > > > > > > > > > complex.
> > > > > > > > > > > > > > > Right
> > > > > > > > > > > > > > > > > >> now
> > > > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > > > >> >> > > have
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> RetriableException
> type,
> > if
> > > > we
> > > > > > are
> > > > > > > > > going
> > > > > > > > > > to
> > > > > > > > > > > > > add a
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > `ProducerRetriableException`
> > > > > > type,
> > > > > > > we
> > > > > > > > > > have
> > > > > > > > > > > to
> > > > > > > > > > > > > put
> > > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > >> new
> > > > > > > > > > > > > > > > > >> >> > > > interface
> > > > > > > > > > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> parent of the
> > > > > RetriableException,
> > > > > > > > > because
> > > > > > > > > > > not
> > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> exceptions are
> > `retriable` in
> > > > > > > general
> > > > > > > > > for
> > > > > > > > > > > > > > > producer, for
> > > > > > > > > > > > > > > > > >> >> example
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> We could have an empty
> > > > > interface
> > > > > > > > > > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > > > > >> >> > > > let
> > > > > > > > > > > > > > > > > >> >> > > > > >> all
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> the thrown exceptions
> > > > implement
> > > > > > for
> > > > > > > > > sure,
> > > > > > > > > > > but
> > > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > bit
> > > > > > > > > > > > > > > > > >> >> > > unorthodox
> > > > > > > > > > > > > > > > > >> >> > > > > in
> > > > > > > > > > > > > > > > > >> >> > > > > >> the
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> way we deal with
> > exceptions
> > > > in
> > > > > > > > general.
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> 2. There are cases
> where
> > we
> > > > > > throw a
> > > > > > > > > > > > > > KafkaException
> > > > > > > > > > > > > > > > > >> wrapping
> > > > > > > > > > > > > > > > > >> >> > > another
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> KafkaException as
> either
> > > > fatal
> > > > > or
> > > > > > > > > > > non-fatal.
> > > > > > > > > > > > If
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > >> >> > > interface
> > > > > > > > > > > > > > > > > >> >> > > > > to
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> solve #1, it is also
> > required
> > > > > to
> > > > > > > > > > implement
> > > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > > >> bloated
> > > > > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> which could replace
> > > > > > KafkaException
> > > > > > > > > type,
> > > > > > > > > > as
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > couldn't
> > > > > > > > > > > > > > > > > >> mark
> > > > > > > > > > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> 3. In terms of the
> > > > > encapsulation,
> > > > > > > > > > wrapping
> > > > > > > > > > > > > means
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > > >> >> limit
> > > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> of affection to the
> > producer
> > > > > > only,
> > > > > > > > > which
> > > > > > > > > > is
> > > > > > > > > > > > > > > important
> > > > > > > > > > > > > > > > > >> since
> > > > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > > > >> >> > > > don't
> > > > > > > > > > > > > > > > > >> >> > > > > >> want
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> shared exception types
> to
> > > > > > > implement a
> > > > > > > > > > > > > > > producer-related
> > > > > > > > > > > > > > > > > >> >> > interface,
> > > > > > > > > > > > > > > > > >> >> > > > > such
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> as
> > > > > > > UnknownTopicOrPartitionException.
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at
> > 3:38
> > > > PM
> > > > > > > > Matthias
> > > > > > > > > > J.
> > > > > > > > > > > > Sax
> > > > > > > > > > > > > <
> > > > > > > > > > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP
> > Boyang!
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> Overall, categorizing
> > > > > exceptions
> > > > > > > > > makes a
> > > > > > > > > > > lot
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > sense.
> > > > > > > > > > > > > > > > > >> As I
> > > > > > > > > > > > > > > > > >> >> > > don't
> > > > > > > > > > > > > > > > > >> >> > > > > know
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> the producer internals
> > well
> > > > > > > enough,
> > > > > > > > I
> > > > > > > > > > > cannot
> > > > > > > > > > > > > > > comment
> > > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> categorization in
> detail
> > > > > though.
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> What I am wondering
> is,
> > if
> > > > we
> > > > > > > should
> > > > > > > > > > > > introduce
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > > >> exception
> > > > > > > > > > > > > > > > > >> >> > > > > interface
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> that non-fatal
> exception
> > > > would
> > > > > > > > > implement
> > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > > >> >> creating a
> > > > > > > > > > > > > > > > > >> >> > > new
> > > > > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> that will wrap
> non-fatal
> > > > > > > exceptions?
> > > > > > > > > > What
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > > be the
> > > > > > > > > > > > > > > > > >> >> > pros/cons
> > > > > > > > > > > > > > > > > >> >> > > > for
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM,
> > Boyang
> > > > > Chen
> > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> I would like to
> start a
> > > > > > > discussion
> > > > > > > > > > thread
> > > > > > > > > > > > for
> > > > > > > > > > > > > > > > > KIP-691:
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to
> > > > simplify
> > > > > > the
> > > > > > > > > > > exception
> > > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > > > > >> logic
> > > > > > > > > > > > > > > > > >> >> > for
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> transactional
> Producer
> > > > users
> > > > > by
> > > > > > > > > > > classifying
> > > > > > > > > > > > > > > fatal and
> > > > > > > > > > > > > > > > > >> >> > non-fatal
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> and throw them
> > > > > correspondingly
> > > > > > > for
> > > > > > > > > > easier
> > > > > > > > > > > > > catch
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > >> retry.
> > > > > > > > > > > > > > > > > >> >> > Let
> > > > > > > > > > > > > > > > > >> >> > > me
> > > > > > > > > > > > > > > > > >> >> > > > > >> know
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > > > --
> > > > > > > > > > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > >> >> > --
> > > > > > > > > > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >> > --
> > > > > > > > > > > > > > > > > >> > -- Guozhang
> > > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > > >> -- Guozhang
> > > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > -- Guozhang
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Hey Jason and Chia-Ping,

looking at the discussion thread for KIP-706, we are still in discussion
with the final API version, so I made the decision to decouple the two KIPs
so that this one could move forward. To be specific, we are still
introducing ProduceFailed error but as a `near-future` improvement, which
should be implemented right after KIP-706 is finalized.

Let me know your thoughts.

Best,
Boyang

On Sat, Jan 30, 2021 at 11:57 AM Chia-Ping Tsai <ch...@apache.org> wrote:

> > a richer return type. Let me know if you are good with this, and whether
> > Chia-Ping is also happy here :)
>
> sure. I'd like to join this bigger party :)
>
> On 2021/01/30 01:03:57, Boyang Chen <re...@gmail.com> wrote:
> > This is a great proposal Jason, I already integrated KIP-691 with KIP-706
> > template to provide a new ProduceFailedException as well as a new
> > Producer#produce API with CompletionStage as old send API replacement
> with
> > a richer return type. Let me know if you are good with this, and whether
> > Chia-Ping is also happy here :)
> >
> > Best,
> > Boyang
> >
> > On Fri, Jan 29, 2021 at 4:23 PM Jason Gustafson <ja...@confluent.io>
> wrote:
> >
> > > Hi Boyang,
> > >
> > > Ok, if we are going to ask all users to update their code anyway,
> maybe we
> > > can get more bang out of this.
> > >
> > > One of the remaining problems is that a user will only get the richer
> > > information if they are supplying a Callback. If they wait on the
> future,
> > > then they will get the usual exception. Our hands are somewhat tied
> here
> > > because of compatibility. One thing we have wanted to do for a long
> time is
> > > introduce a new send method which returns
> > > `CompletableFuture<RecordMetadata>` instead. If we did so, then we
> wouldn't
> > > need to change `Callback` at all. Instead we could make the contract
> of the
> > > new api that it will always return `SendFailedException`, which could
> > > provide a method to get the failure type.
> > >
> > > I think Chia-Ping has been thinking about this:
> > > https://issues.apache.org/jira/browse/KAFKA-12227. Is there a way to
> > > consolidate our efforts in order to avoid unnecessary noise for users?
> It
> > > would be awesome to see a more modern API which also addresses the
> > > confusing error handling.
> > >
> > > -Jason
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jan 29, 2021 at 3:48 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Thanks Jason, my hope is that we could streamline the callback
> function
> > > > signature, so that in the next major release (3.0),
> > > > onCompletion(RecordMetadata metadata, SendFailure sendFailure) will
> > > become
> > > > the only API to be implemented.
> > > >
> > > > On Fri, Jan 29, 2021 at 3:42 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hi Boyang,
> > > > >
> > > > > Do you think it's necessary to deprecate the old `onCompletion`
> > > > callback? I
> > > > > was thinking there's probably no harm leaving it around. Users
> might
> > > not
> > > > > care about the failure type. Other than that, it looks good to me.
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > On Thu, Jan 28, 2021 at 6:58 PM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hey Guozhang,
> > > > > >
> > > > > > the CommitFailedException wrapping would cover all the non-fatal
> > > > > exceptions
> > > > > > as we listed out in the KIP, generally speaking any exception
> that
> > > > could
> > > > > > recover safely by calling abortTxn should be wrapped.
> > > > > >
> > > > > > Best,
> > > > > > Boyang
> > > > > >
> > > > > > On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <
> wangguoz@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hey Boyang,
> > > > > > >
> > > > > > > I think maybe there's something cracking here :) I'm just
> asking
> > > for
> > > > > > > clarifications that as of today, which non-fatal exceptions the
> > > newly
> > > > > > > introduced CommitFailedException would cover, and it seems to
> be
> > > only
> > > > > 1)
> > > > > > > unknown pid, 2) invalid pid mapping, and 3) concurrent
> > > transactions.
> > > > Is
> > > > > > > that correct?
> > > > > > >
> > > > > > > On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <
> > > > > reluctanthero104@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Guozhang, I think TimeoutException would not be covered
> here
> > > as
> > > > > it
> > > > > > > > potentially has a risk of hitting an illegal state on the
> broker
> > > > side
> > > > > > > when
> > > > > > > > the previous commit was actually successful. Users should
> try to
> > > > > > increase
> > > > > > > > their max.block.ms to avoid hitting the timeout as a base
> > > > > suggestion,
> > > > > > > > which
> > > > > > > > is discussed in the KIP.
> > > > > > > >
> > > > > > > > As for the umbrella exception, I agree it has values to some
> > > > extent,
> > > > > > but
> > > > > > > in
> > > > > > > > terms of adoption difficulty, additional exception types
> usually
> > > > make
> > > > > > the
> > > > > > > > error handling more complex than simplified, and we are
> doing our
> > > > > best
> > > > > > to
> > > > > > > > avoid compatibility issues caused by wrapping previously
> thrown
> > > raw
> > > > > > > > exceptions that invalidates user error handling. And I'm
> also not
> > > > in
> > > > > > > favor
> > > > > > > > of our exception hierarchy today where all exceptions are
> > > > subclasses
> > > > > of
> > > > > > > > KafkaException, and we already mixed its handling with other
> > > > specific
> > > > > > > > exception types that could be caught as superclass
> > > KafkaException.
> > > > > That
> > > > > > > > could be addressed in a separate proposal if we see
> necessary.
> > > > > > > >
> > > > > > > > Boyang
> > > > > > > >
> > > > > > > > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <
> > > wangguoz@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Jason for the suggestion, that looks good to me too.
> > > > > > > > >
> > > > > > > > > Regarding the non-fatal exceptions wrapped as
> CommitFailed, I
> > > > would
> > > > > > > like
> > > > > > > > to
> > > > > > > > > clarify if we would cover all the following cases: 1)
> timeout,
> > > 2)
> > > > > > > unknown
> > > > > > > > > pid, 3) invalid pid mapping, 4) concurrent transactions?
> > > > > > > > >
> > > > > > > > > BTW I think it still makes sense to use an umbrella
> exception
> > > in
> > > > > case
> > > > > > > in
> > > > > > > > > the future we add more non-fatal cases even though today we
> > > only
> > > > > > have a
> > > > > > > > > few.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Guozhang
> > > > > > > > >
> > > > > > > > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <
> > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Jason, I agree with the proposed solution here,
> will
> > > > > update
> > > > > > > the
> > > > > > > > > KIP.
> > > > > > > > > >
> > > > > > > > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <
> > > > > > jason@confluent.io
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Boyang,
> > > > > > > > > > >
> > > > > > > > > > > It seems like a reasonable suggestion. I wonder if a
> flag
> > > is
> > > > > > > > sufficient
> > > > > > > > > > > though. The current `Callback` documentation treats
> "fatal"
> > > > > > errors
> > > > > > > > from
> > > > > > > > > > the
> > > > > > > > > > > perspective of the individual message that was sent.
> > > > > > > > > > >
> > > > > > > > > > > ```
> > > > > > > > > > >     *                  Non-Retriable exceptions
> (fatal, the
> > > > > > message
> > > > > > > > > will
> > > > > > > > > > > never be sent):
> > > > > > > > > > > ```
> > > > > > > > > > >
> > > > > > > > > > > However, we also have fatal errors from the
> perspective of
> > > > the
> > > > > > > > > > transaction
> > > > > > > > > > > (e.g. when the producer gets fenced). Perhaps that
> suggests
> > > > we
> > > > > > need
> > > > > > > > > > > something richer than a boolean flag:
> > > > > > > > > > >
> > > > > > > > > > > At a high level, I think the following cases are
> possible:
> > > > > > > > > > >
> > > > > > > > > > > - message rejected (e.g. message too large, invalid
> topic)
> > > > > > > > > > > - delivery failed after retries/delivery timeout (e.g.
> > > > timeout,
> > > > > > crc
> > > > > > > > > > error,
> > > > > > > > > > > not enough replicas)
> > > > > > > > > > > - transaction failed (e.g. producer fenced, invalid
> > > > transaction
> > > > > > > > state)
> > > > > > > > > > >
> > > > > > > > > > > Perhaps instead we can have a type like the following:
> > > > > > > > > > >
> > > > > > > > > > > class SendFailure {
> > > > > > > > > > >   FailureType failureType;
> > > > > > > > > > >   Exception cause;
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > enum FailureType {
> > > > > > > > > > >   MESSSAGE_REJECTED, DELIVERY_FAILED,
> TRANSACTION_FAILED
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > (Not married to any of these names, just a starting
> point.)
> > > > > > > > > > >
> > > > > > > > > > > Then we add a new `onCompletion` as you've suggested:
> > > > > > > > > > >
> > > > > > > > > > > default void onCompletion(RecordMetadata metadata,
> > > > SendFailure
> > > > > > > > > failure) {
> > > > > > > > > > >   this.onCompletion(metadata, failure.cause());
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > This would give streams and other applications enough
> > > > > information
> > > > > > > to
> > > > > > > > > know
> > > > > > > > > > > whether the message can be retried and whether the
> > > > transaction
> > > > > > can
> > > > > > > be
> > > > > > > > > > > aborted.
> > > > > > > > > > >
> > > > > > > > > > > What do you think?
> > > > > > > > > > >
> > > > > > > > > > > -Jason
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks Jason for the thoughts.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <
> > > > > > > > jason@confluent.io
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Boyang,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the iterations here. I think this is
> > > something
> > > > > we
> > > > > > > > should
> > > > > > > > > > > have
> > > > > > > > > > > > > done a long time ago. It sounds like there are two
> API
> > > > > > changes
> > > > > > > > > here:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. We are introducing the `CommitFailedException`
> to
> > > wrap
> > > > > > > > abortable
> > > > > > > > > > > > > errors that are raised from `commitTransaction`.
> This
> > > > > sounds
> > > > > > > fine
> > > > > > > > > to
> > > > > > > > > > > me.
> > > > > > > > > > > > As
> > > > > > > > > > > > > far as I know, the only case we might need this is
> when
> > > > we
> > > > > > add
> > > > > > > > > > support
> > > > > > > > > > > to
> > > > > > > > > > > > > let producers recover from coordinator timeouts.
> Are
> > > > there
> > > > > > any
> > > > > > > > > > others?
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think the purpose here is to ensure non-fatal
> > > > exceptions
> > > > > > are
> > > > > > > > > > unified
> > > > > > > > > > > > under the same
> > > > > > > > > > > > exception umbrella, to make the proceeding to abort
> any
> > > > > ongoing
> > > > > > > > > > > transaction
> > > > > > > > > > > > much easier.
> > > > > > > > > > > > I don't think `coordinator timeouts` is the only
> case to
> > > > > > recover
> > > > > > > > > here,
> > > > > > > > > > > > since we have other
> > > > > > > > > > > > non-fatal exceptions such as UnknownPid.
> > > > > > > > > > > >
> > > > > > > > > > > > 2. We are wrapping non-fatal errors raised from
> `send` as
> > > > > > > > > > > `KafkaException`.
> > > > > > > > > > > > > The motivation for this is less clear to me and it
> > > > doesn't
> > > > > > look
> > > > > > > > > like
> > > > > > > > > > > the
> > > > > > > > > > > > > example from the KIP depends on it. My concern
> here is
> > > > > > > > > compatibility.
> > > > > > > > > > > > > Currently we have the following documentation for
> the
> > > > > > > `Callback`
> > > > > > > > > API:
> > > > > > > > > > > > >
> > > > > > > > > > > > > ```
> > > > > > > > > > > > >      *                  Non-Retriable exceptions
> > > (fatal,
> > > > > the
> > > > > > > > > message
> > > > > > > > > > > will
> > > > > > > > > > > > > never be sent):
> > > > > > > > > > > > >      *
> > > > > > > > > > > > >      *                  InvalidTopicException
> > > > > > > > > > > > >      *
> OffsetMetadataTooLargeException
> > > > > > > > > > > > >      *
> RecordBatchTooLargeException
> > > > > > > > > > > > >      *                  RecordTooLargeException
> > > > > > > > > > > > >      *                  UnknownServerException
> > > > > > > > > > > > >      *                  UnknownProducerIdException
> > > > > > > > > > > > >      *
> InvalidProducerEpochException
> > > > > > > > > > > > >      *
> > > > > > > > > > > > >      *                  Retriable exceptions
> > > (transient,
> > > > > may
> > > > > > be
> > > > > > > > > > covered
> > > > > > > > > > > > by
> > > > > > > > > > > > > increasing #.retries):
> > > > > > > > > > > > >      *
> > > > > > > > > > > > >      *                  CorruptRecordException
> > > > > > > > > > > > >      *                  InvalidMetadataException
> > > > > > > > > > > > >      *
> > > > > NotEnoughReplicasAfterAppendException
> > > > > > > > > > > > >      *                  NotEnoughReplicasException
> > > > > > > > > > > > >      *                  OffsetOutOfRangeException
> > > > > > > > > > > > >      *                  TimeoutException
> > > > > > > > > > > > >      *
> > > UnknownTopicOrPartitionException
> > > > > > > > > > > > > ```
> > > > > > > > > > > > >
> > > > > > > > > > > > > If we wrap all the retriable exceptions documented
> here
> > > > as
> > > > > > > > > > > > > `KafkaException`, wouldn't that break any error
> > > handling
> > > > > that
> > > > > > > > users
> > > > > > > > > > > might
> > > > > > > > > > > > > already have? it's gonna introduce a compatibility
> > > issue.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The original intention was to simplify `send`
> callback
> > > > > error
> > > > > > > > > handling
> > > > > > > > > > > by
> > > > > > > > > > > > doing exception wrapping, as on Streams level
> > > > > > > > > > > > we have to prepare an exhausting list of exceptions
> to
> > > > catch
> > > > > as
> > > > > > > > > fatal,
> > > > > > > > > > > and
> > > > > > > > > > > > the same lengthy list to catch as
> > > > > > > > > > > > non-fatal. It would be much easier if we got `hints`
> from
> > > > the
> > > > > > > > > callback.
> > > > > > > > > > > > However,
> > > > > > > > > > > > I agree there is a compatibility concern, what about
> > > > > > deprecating
> > > > > > > > the
> > > > > > > > > > > > existing:
> > > > > > > > > > > >
> > > > > > > > > > > > void onCompletion(RecordMetadata metadata, Exception
> > > > > exception)
> > > > > > > > > > > >
> > > > > > > > > > > > and replace it with:
> > > > > > > > > > > >
> > > > > > > > > > > > default void onCompletion(RecordMetadata metadata,
> > > > Exception
> > > > > > > > > exception,
> > > > > > > > > > > > boolean isFatal) {
> > > > > > > > > > > >   this.onCompletion(metadata, exception);
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > to make sure new users get the benefit of
> understanding
> > > the
> > > > > > > > fatality
> > > > > > > > > > > based
> > > > > > > > > > > > on the info presented by the producer?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Jason
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <
> > > > > > info@hiringuru.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Why  we are receiving all emails kindly remove us
> > > from
> > > > > > > > > > > > > > dev@kafka.apache.org we don't want to receive
> emails
> > > > > > > anymore.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <
> > > > > wangguoz@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks Boyang, yes I think I was confused
> about the
> > > > > > > different
> > > > > > > > > > > > handling
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > two abortTxn calls, and now I get it was not
> > > > > > intentional. I
> > > > > > > > > > think I
> > > > > > > > > > > > do
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > have more concerns.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the clarification Guozhang, I got
> your
> > > > > point
> > > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > > want
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > have a consistent handling of fatal
> exceptions
> > > > being
> > > > > > > thrown
> > > > > > > > > > from
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > abortTxn. I modified the current template to
> move
> > > > the
> > > > > > > fatal
> > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > try-catch outside of the processing loop to
> make
> > > > sure
> > > > > > we
> > > > > > > > > could
> > > > > > > > > > > get
> > > > > > > > > > > > a
> > > > > > > > > > > > > > chance
> > > > > > > > > > > > > > > > to close consumer/producer modules. Let me
> know
> > > > what
> > > > > > you
> > > > > > > > > think.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen
> <
> > > > > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > My understanding is that abortTransaction
> would
> > > > > only
> > > > > > > > throw
> > > > > > > > > > when
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > producer is in fatal state. For
> CommitFailed,
> > > the
> > > > > > > > producer
> > > > > > > > > > > should
> > > > > > > > > > > > > > still
> > > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > > in the abortable error state, so that
> > > > > > abortTransaction
> > > > > > > > call
> > > > > > > > > > > would
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > throw.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang
> Wang
> > > <
> > > > > > > > > > > > wangguoz@gmail.com
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> Or are you going to maintain some internal
> > > state
> > > > > > such
> > > > > > > > > that,
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> `abortTransaction` in the catch block
> would
> > > > never
> > > > > > > throw
> > > > > > > > > > again?
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang
> > > Wang <
> > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > I've also thought about this (i.e. using
> > > > > > > CommitFailed
> > > > > > > > > for
> > > > > > > > > > > all
> > > > > > > > > > > > > > > > >> non-fatal),
> > > > > > > > > > > > > > > > >> > but what I'm pondering is that, in the
> catch
> > > > > > > > > > (CommitFailed)
> > > > > > > > > > > > > block,
> > > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > > >> > would happen if the
> > > > > `producer.abortTransaction();`
> > > > > > > > > throws
> > > > > > > > > > > > again?
> > > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > > >> > that be captured as a fatal and cause
> the
> > > > client
> > > > > > to
> > > > > > > > > close
> > > > > > > > > > > > again.
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > If yes, then naively the pattern would
> be:
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > > > > > > > > >> >         // Transaction commit failed
> with
> > > > > > abortable
> > > > > > > > > error,
> > > > > > > > > > > > user
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > >> reset
> > > > > > > > > > > > > > > > >> >         // the application state and
> resume
> > > > > with a
> > > > > > > new
> > > > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > >> > root
> > > > > > > > > > > > > > > > >> >         // cause was wrapped in the
> thrown
> > > > > > > exception.
> > > > > > > > > > > > > > > > >> >
> > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > > >> >         try {
> > > > > > > > > > > > > > > > >> >             producer.abortTransaction();
> > > > > > > > > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > > > > > > > > >> >             producer.close();
> > > > > > > > > > > > > > > > >> >             consumer.close();
> > > > > > > > > > > > > > > > >> >             throw e;
> > > > > > > > > > > > > > > > >> >         }
> > > > > > > > > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > > > > > > > > >> >         producer.close();
> > > > > > > > > > > > > > > > >> >         consumer.close();
> > > > > > > > > > > > > > > > >> >         throw e;
> > > > > > > > > > > > > > > > >> >     }
> > > > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > Guozhang
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang
> > > Chen <
> > > > > > > > > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> Jason and I were discussing the new API
> > > > offline
> > > > > > and
> > > > > > > > > > decided
> > > > > > > > > > > > to
> > > > > > > > > > > > > > take
> > > > > > > > > > > > > > > > >> >> another
> > > > > > > > > > > > > > > > >> >> approach. Firstly, the reason not to
> > > invent a
> > > > > new
> > > > > > > API
> > > > > > > > > > with
> > > > > > > > > > > > > > returned
> > > > > > > > > > > > > > > > >> >> boolean
> > > > > > > > > > > > > > > > >> >> flag is for compatibility
> consideration,
> > > > since
> > > > > > old
> > > > > > > > EOS
> > > > > > > > > > code
> > > > > > > > > > > > > > would not
> > > > > > > > > > > > > > > > >> know
> > > > > > > > > > > > > > > > >> >> that a given transaction commit was
> failed
> > > > > > > internally
> > > > > > > > > as
> > > > > > > > > > > they
> > > > > > > > > > > > > > don't
> > > > > > > > > > > > > > > > >> listen
> > > > > > > > > > > > > > > > >> >> to the output flag. Our proposed
> > > alternative
> > > > > > > solution
> > > > > > > > > is
> > > > > > > > > > to
> > > > > > > > > > > > let
> > > > > > > > > > > > > > > > >> >> *commitTransaction
> > > > > > > > > > > > > > > > >> >> throw CommitFailedException whenever
> the
> > > > commit
> > > > > > > > failed
> > > > > > > > > > with
> > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > >> >> exception*, so that on the caller side
> the
> > > > > > handling
> > > > > > > > > logic
> > > > > > > > > > > > > > becomes:
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> try {
> > > > > > > > > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > > > > > > > > >> >>
>  producer.commitTransaction();
> > > > > > > > > > > > > > > > >> >>         } else {
> > > > > > > > > > > > > > > > >> >>
> > > > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > > >> >>
>  producer.abortTransaction();
> > > > > > > > > > > > > > > > >> >>         }
> > > > > > > > > > > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > > > > > > > > > > >> >>         // Transaction commit failed
> with
> > > > > > abortable
> > > > > > > > > > error,
> > > > > > > > > > > > user
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > >> >> reset
> > > > > > > > > > > > > > > > >> >>         // the application state and
> resume
> > > > > with
> > > > > > a
> > > > > > > > new
> > > > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > >> >> root
> > > > > > > > > > > > > > > > >> >>         // cause was wrapped in the
> thrown
> > > > > > > exception.
> > > > > > > > > > > > > > > > >> >>
> > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > > > > > > > > >> >>         producer.close();
> > > > > > > > > > > > > > > > >> >>         consumer.close();
> > > > > > > > > > > > > > > > >> >>         throw e;
> > > > > > > > > > > > > > > > >> >>     }
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> This approach looks cleaner as all
> > > exception
> > > > > > types
> > > > > > > > > other
> > > > > > > > > > > than
> > > > > > > > > > > > > > > > >> CommitFailed
> > > > > > > > > > > > > > > > >> >> will doom to be fatal, which is very
> easy
> > > to
> > > > > > adopt
> > > > > > > > for
> > > > > > > > > > > users.
> > > > > > > > > > > > > In
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> >> meantime, we still maintain the
> commitTxn
> > > > > > behavior
> > > > > > > to
> > > > > > > > > > throw
> > > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > >> >> silently failing.
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> In addition, we decided to drop the
> > > > > > recommendation
> > > > > > > to
> > > > > > > > > > > handle
> > > > > > > > > > > > > > > > >> >> TimeoutException and leave it to the
> users
> > > to
> > > > > > make
> > > > > > > > the
> > > > > > > > > > > call.
> > > > > > > > > > > > > The
> > > > > > > > > > > > > > > > >> downside
> > > > > > > > > > > > > > > > >> >> for blindly calling abortTxn upon
> timeout
> > > is
> > > > > that
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > > > > result in
> > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > >> >> illegal state when the commit was
> already
> > > > > > > successful
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > >> >> side. Without a good guarantee on the
> > > > outcome,
> > > > > > > > > > > > overcomplicating
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> Let me know your thoughts on the new
> > > approach
> > > > > > here,
> > > > > > > > > thank
> > > > > > > > > > > > you!
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> Best,
> > > > > > > > > > > > > > > > >> >> Boyang
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM
> Guozhang
> > > > Wang
> > > > > <
> > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > > >> >> wrote:
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > Thanks for your clarification on
> 2)/3),
> > > > that
> > > > > > > makes
> > > > > > > > > > sense.
> > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM
> Boyang
> > > > Chen
> > > > > <
> > > > > > > > > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > >> >> > > Thanks for the input Guozhang,
> replied
> > > > > > inline.
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM
> > > Guozhang
> > > > > > Wang <
> > > > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > Thanks for the updated KIP. I
> read it
> > > > > again
> > > > > > > and
> > > > > > > > > > have
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> following
> > > > > > > > > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > 0. I'm a bit concerned that if
> > > > commitTxn
> > > > > > does
> > > > > > > > not
> > > > > > > > > > > throw
> > > > > > > > > > > > > any
> > > > > > > > > > > > > > > > >> >> non-fatal
> > > > > > > > > > > > > > > > >> >> > > > exception, and instead we rely
> on the
> > > > > > > > subsequent
> > > > > > > > > > > > beginTxn
> > > > > > > > > > > > > > call
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> >> > throw,
> > > > > > > > > > > > > > > > >> >> > > it
> > > > > > > > > > > > > > > > >> >> > > > may violate some callers with a
> > > pattern
> > > > > > that
> > > > > > > > > > relying
> > > > > > > > > > > on
> > > > > > > > > > > > > > > > >> commitTxn to
> > > > > > > > > > > > > > > > >> >> > > > succeed to make some non-rollback
> > > > > > operations.
> > > > > > > > For
> > > > > > > > > > > > > example:
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > > > > > > > > >> >> > > > // do some read-write on my
> local DB
> > > > > > > > > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > > > > > > > > >> >> > > > // if commitTxn succeeds, then
> commit
> > > > the
> > > > > > DB
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > -------------
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > The issue is that, committing DB
> is a
> > > > > > > > > non-rollback
> > > > > > > > > > > > > > operation,
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > >> >> users
> > > > > > > > > > > > > > > > >> >> > > may
> > > > > > > > > > > > > > > > >> >> > > > just rely on commitTxn to return
> > > > without
> > > > > > > error
> > > > > > > > to
> > > > > > > > > > > make
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > >> >> > non-rollback
> > > > > > > > > > > > > > > > >> >> > > > call. Of course we can just claim
> > > this
> > > > > > > pattern
> > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > > > > > > > legitimate
> > > > > > > > > > > > > > > > >> >> and is
> > > > > > > > > > > > > > > > >> >> > > not
> > > > > > > > > > > > > > > > >> >> > > > the right way of doing things,
> but
> > > many
> > > > > > users
> > > > > > > > may
> > > > > > > > > > > > > naturally
> > > > > > > > > > > > > > > > adopt
> > > > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > > > >> >> > > > pattern.
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > So maybe we should still let
> > > commitTxn
> > > > > also
> > > > > > > > throw
> > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > >> >> exceptions,
> > > > > > > > > > > > > > > > >> >> > > in
> > > > > > > > > > > > > > > > >> >> > > > which case we would then call
> > > abortTxn
> > > > > > again.
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > But if we do this, the pattern
> > > becomes:
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > > > > > > > > >> >> > > >     try {
> > > > > > > > > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > > > > > > > > >> >> > > >     } catch (...) {        //
> > > enumerate
> > > > > all
> > > > > > > > fatal
> > > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > Thank makes me think, the
> alternative
> > > > we
> > > > > > have
> > > > > > > > > > > discussed
> > > > > > > > > > > > > > offline
> > > > > > > > > > > > > > > > >> may
> > > > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > > > >> >> > > > better: let commitTxn() to
> return a
> > > > > boolean
> > > > > > > > flag.
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > * If it returns true, it means
> the
> > > > commit
> > > > > > > > > > succeeded.
> > > > > > > > > > > > > Users
> > > > > > > > > > > > > > can
> > > > > > > > > > > > > > > > >> >> > > comfortably
> > > > > > > > > > > > > > > > >> >> > > > continue and do any external
> > > > non-rollback
> > > > > > > > > > operations
> > > > > > > > > > > if
> > > > > > > > > > > > > > they
> > > > > > > > > > > > > > > > >> like.
> > > > > > > > > > > > > > > > >> >> > > > * If it returns false, it means
> the
> > > > > commit
> > > > > > > > failed
> > > > > > > > > > > with
> > > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > >> >> error
> > > > > > > > > > > > > > > > >> >> > > *AND
> > > > > > > > > > > > > > > > >> >> > > > the txn has been aborted*. Users
> do
> > > not
> > > > > > need
> > > > > > > to
> > > > > > > > > > call
> > > > > > > > > > > > > > abortTxn
> > > > > > > > > > > > > > > > >> again.
> > > > > > > > > > > > > > > > >> >> > > > * If it throws, then it means
> fatal
> > > > > errors.
> > > > > > > > Users
> > > > > > > > > > > > should
> > > > > > > > > > > > > > shut
> > > > > > > > > > > > > > > > >> down
> > > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > > >> >> > > > client.
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > In this case, the pattern
> becomes:
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > > > > > > > > >> >> > > >         commitSucceeded =
> commitTxn()
> > > > > > > > > > > > > > > > >> >> > > >     } else {
> > > > > > > > > > > > > > > > >> >> > > >         // reset offsets,
> rollback
> > > > > > > operations,
> > > > > > > > > etc
> > > > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > > > > > > > > >> >> > > >    // do other non-rollbackable
> > > things
> > > > > > > > > > > > > > > > >> >> > > > else
> > > > > > > > > > > > > > > > >> >> > > >    // reset offsets, rollback
> > > > operations,
> > > > > > etc
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > Of course, if we want to have
> better
> > > > > > > visibility
> > > > > > > > > > into
> > > > > > > > > > > > what
> > > > > > > > > > > > > > > > caused
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> >> > > commit
> > > > > > > > > > > > > > > > >> >> > > > to fail and txn to abort. We can
> let
> > > > the
> > > > > > > return
> > > > > > > > > > type
> > > > > > > > > > > be
> > > > > > > > > > > > > an
> > > > > > > > > > > > > > enum
> > > > > > > > > > > > > > > > >> >> instead
> > > > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > > > >> >> > > > a flag. But my main idea is to
> still
> > > > let
> > > > > > the
> > > > > > > > > > > commitTxn
> > > > > > > > > > > > be
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> final
> > > > > > > > > > > > > > > > >> >> > point
> > > > > > > > > > > > > > > > >> >> > > > users can tell whether this txn
> > > > succeeded
> > > > > > or
> > > > > > > > not,
> > > > > > > > > > > > instead
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > >> >> relying on
> > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > I agree that adding a boolean
> flag is
> > > > > > indeed
> > > > > > > > > useful
> > > > > > > > > > > in
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > case.
> > > > > > > > > > > > > > > > >> >> Will
> > > > > > > > > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> > > 1. Re: "while maintaining the
> behavior
> > > to
> > > > > > throw
> > > > > > > > > fatal
> > > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > >> >> raw"
> > > > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > > > >> >> > > > sure what do you mean by "throw"
> > > here.
> > > > > Are
> > > > > > > you
> > > > > > > > > > > > proposing
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> >> callback
> > > > > > > > > > > > > > > > >> >> > > would
> > > > > > > > > > > > > > > > >> >> > > > *pass* (not throw) in any fatal
> > > > > exceptions
> > > > > > > when
> > > > > > > > > > > > triggered
> > > > > > > > > > > > > > > > without
> > > > > > > > > > > > > > > > >> >> > > wrapping?
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > Yes, I want to say *pass*, the
> > > benefit
> > > > is
> > > > > > to
> > > > > > > > make
> > > > > > > > > > the
> > > > > > > > > > > > end
> > > > > > > > > > > > > > > > user's
> > > > > > > > > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> > > > 2. I'm not sure I understand the
> > > claim
> > > > > > > > regarding
> > > > > > > > > > the
> > > > > > > > > > > > > > callback
> > > > > > > > > > > > > > > > >> that
> > > > > > > > > > > > > > > > >> >> "In
> > > > > > > > > > > > > > > > >> >> > > EOS
> > > > > > > > > > > > > > > > >> >> > > > setup, it is not required to
> handle
> > > the
> > > > > > > > > exception".
> > > > > > > > > > > Are
> > > > > > > > > > > > > you
> > > > > > > > > > > > > > > > >> >> proposing
> > > > > > > > > > > > > > > > >> >> > > that,
> > > > > > > > > > > > > > > > >> >> > > > e.g. in Streams, we do not try to
> > > > handle
> > > > > > any
> > > > > > > > > > > exceptions
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > EOS
> > > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > > >> >> > enabled
> > > > > > > > > > > > > > > > >> >> > > > in the callback anymore, but
> just let
> > > > > > > > commitTxn()
> > > > > > > > > > > > itself
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > > > >> >> > > > notified about the problem?
> > > Personally
> > > > I
> > > > > > > think
> > > > > > > > in
> > > > > > > > > > > > Streams
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > >> should
> > > > > > > > > > > > > > > > >> >> > just
> > > > > > > > > > > > > > > > >> >> > > > make the handling logic of the
> > > callback
> > > > > to
> > > > > > be
> > > > > > > > > > > > consistent
> > > > > > > > > > > > > > > > >> regardless
> > > > > > > > > > > > > > > > >> >> of
> > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > >> >> > > > EOS settings (today we have
> different
> > > > > logic
> > > > > > > > > > depending
> > > > > > > > > > > > on
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > >> logic,
> > > > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > > > >> >> > > > I think could be unified as
> well).
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > My idea originates from the
> claim on
> > > > send
> > > > > > > API:
> > > > > > > > > > > > > > > > >> >> > > "When used as part of a
> transaction, it
> > > > is
> > > > > > not
> > > > > > > > > > > necessary
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > define
> > > > > > > > > > > > > > > > >> a
> > > > > > > > > > > > > > > > >> >> > > callback or check the result of the
> > > > future
> > > > > > in
> > > > > > > > > order
> > > > > > > > > > to
> > > > > > > > > > > > > > detect
> > > > > > > > > > > > > > > > >> errors
> > > > > > > > > > > > > > > > >> >> > from
> > > > > > > > > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > > > > > > > > >> >> > > My understanding is that for EOS,
> the
> > > > > > exception
> > > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > > > > detected
> > > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > > >> >> > calling
> > > > > > > > > > > > > > > > >> >> > > transactional APIs either way, so
> it
> > > is a
> > > > > > > > duplicate
> > > > > > > > > > > > > handling
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > >> track
> > > > > > > > > > > > > > > > >> >> > > all the sendExceptions in
> > > > RecordCollector.
> > > > > > > > > However, I
> > > > > > > > > > > > > looked
> > > > > > > > > > > > > > up
> > > > > > > > > > > > > > > > >> >> > > sendException is being used today
> as
> > > > > follow:
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> > > 1. Pass to
> "ProductionExceptionHandler"
> > > > for
> > > > > > any
> > > > > > > > > > default
> > > > > > > > > > > > or
> > > > > > > > > > > > > > > > >> customized
> > > > > > > > > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > > > > > > > > >> >> > > 2. Stop collecting offset info or
> new
> > > > > > > exceptions
> > > > > > > > > > > > > > > > >> >> > > 3. Check and rethrow exceptions in
> > > > close(),
> > > > > > > > flush()
> > > > > > > > > > or
> > > > > > > > > > > > new
> > > > > > > > > > > > > > send()
> > > > > > > > > > > > > > > > >> >> calls
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> > > To my understanding, we should
> still
> > > > honor
> > > > > > the
> > > > > > > > > > > commitment
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > #1
> > > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > > >> >> any
> > > > > > > > > > > > > > > > >> >> > > user customized implementation.
> The #2
> > > > does
> > > > > > not
> > > > > > > > > > really
> > > > > > > > > > > > > > affect our
> > > > > > > > > > > > > > > > >> >> > decision
> > > > > > > > > > > > > > > > >> >> > > upon EOS. The #3 here is still
> valuable
> > > > as
> > > > > it
> > > > > > > > could
> > > > > > > > > > > help
> > > > > > > > > > > > us
> > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > > >> fast
> > > > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > > > >> >> > > new send() instead of waiting to
> later
> > > > > stage
> > > > > > of
> > > > > > > > > > > > processing.
> > > > > > > > > > > > > > In
> > > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > >> >> > sense,
> > > > > > > > > > > > > > > > >> >> > > I agree we should continue to
> record
> > > send
> > > > > > > > > exceptions
> > > > > > > > > > > even
> > > > > > > > > > > > > > under
> > > > > > > > > > > > > > > > EOS
> > > > > > > > > > > > > > > > >> >> case
> > > > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > > > >> >> > > ensure the strength of stream side
> > > > Producer
> > > > > > > > logic.
> > > > > > > > > On
> > > > > > > > > > > the
> > > > > > > > > > > > > > safer
> > > > > > > > > > > > > > > > >> side,
> > > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > > >> >> > no
> > > > > > > > > > > > > > > > >> >> > > longer need to wrap certain fatal
> > > > > exceptions
> > > > > > > like
> > > > > > > > > > > > > > ProducerFenced
> > > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > >> >> > > TaskMigrated, since they should not
> > > crash
> > > > > the
> > > > > > > > > stream
> > > > > > > > > > > > thread
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > >> thrown
> > > > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > > > >> >> > > raw format, once we adopt the new
> > > > > processing
> > > > > > > > model
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > > send
> > > > > > > > > > > > > > > > >> phase.
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM
> > > Boyang
> > > > > > Chen <
> > > > > > > > > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > > Thanks for everyone's feedback
> so
> > > > far.
> > > > > I
> > > > > > > have
> > > > > > > > > > > > polished
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > > > >> >> after
> > > > > > > > > > > > > > > > >> >> > > > > offline discussion with some
> folks
> > > > > > working
> > > > > > > on
> > > > > > > > > EOS
> > > > > > > > > > > to
> > > > > > > > > > > > > > make the
> > > > > > > > > > > > > > > > >> >> > exception
> > > > > > > > > > > > > > > > >> >> > > > > handling more lightweight. The
> > > > > essential
> > > > > > > > change
> > > > > > > > > > is
> > > > > > > > > > > > that
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > >> not
> > > > > > > > > > > > > > > > >> >> > > > > inventing a new intermediate
> > > > exception
> > > > > > > type,
> > > > > > > > > but
> > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > > >> >> separating a
> > > > > > > > > > > > > > > > >> >> > > > full
> > > > > > > > > > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > > > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > >> >> > > > > This way for any exception
> thrown
> > > > from
> > > > > > > phase
> > > > > > > > 1,
> > > > > > > > > > > will
> > > > > > > > > > > > be
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > >> >> indicator
> > > > > > > > > > > > > > > > >> >> > in
> > > > > > > > > > > > > > > > >> >> > > > > phase 2 whether we should do
> commit
> > > > or
> > > > > > > abort,
> > > > > > > > > and
> > > > > > > > > > > > from
> > > > > > > > > > > > > > now on
> > > > > > > > > > > > > > > > >> >> > > > > `commitTransaction` should only
> > > throw
> > > > > > fatal
> > > > > > > > > > > > exceptions,
> > > > > > > > > > > > > > > > >> similar to
> > > > > > > > > > > > > > > > >> >> > > > > `abortTransaction`, so that any
> > > > > > > > KafkaException
> > > > > > > > > > > caught
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> >> commit
> > > > > > > > > > > > > > > > >> >> > > phase
> > > > > > > > > > > > > > > > >> >> > > > > will be definitely fatal to
> crash
> > > the
> > > > > > app.
> > > > > > > > For
> > > > > > > > > > more
> > > > > > > > > > > > > > advanced
> > > > > > > > > > > > > > > > >> users
> > > > > > > > > > > > > > > > >> >> > such
> > > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > > >> >> > > > > Streams, we have the ability to
> > > > further
> > > > > > > wrap
> > > > > > > > > > > selected
> > > > > > > > > > > > > > types
> > > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > > > >> >> > > > > exceptions to trigger task
> > > migration
> > > > if
> > > > > > > > > > necessary.
> > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > >> >> > > > > More details in the KIP, feel
> free
> > > to
> > > > > > take
> > > > > > > > > > another
> > > > > > > > > > > > > look,
> > > > > > > > > > > > > > > > >> thanks!
> > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM
> > > > Boyang
> > > > > > > Chen <
> > > > > > > > > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > >> >> > > > > > Thanks Bruno for the
> feedback.
> > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26
> AM
> > > > Bruno
> > > > > > > > Cadonna
> > > > > > > > > <
> > > > > > > > > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >> Like Matthias, I do also not
> > > know
> > > > > the
> > > > > > > > > producer
> > > > > > > > > > > > > > internal
> > > > > > > > > > > > > > > > well
> > > > > > > > > > > > > > > > >> >> > enough
> > > > > > > > > > > > > > > > >> >> > > to
> > > > > > > > > > > > > > > > >> >> > > > > >> comment on the
> categorization.
> > > > > > However,
> > > > > > > I
> > > > > > > > > > think
> > > > > > > > > > > > > > having a
> > > > > > > > > > > > > > > > >> super
> > > > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > > > >> >> > > > > >> (e.g. RetriableException)
> that
> > > > > > encodes
> > > > > > > if
> > > > > > > > > an
> > > > > > > > > > > > > > exception is
> > > > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > > > >> >> > > > not
> > > > > > > > > > > > > > > > >> >> > > > > >> is cleaner, better
> > > understandable
> > > > > and
> > > > > > > less
> > > > > > > > > > > > > > error-prone,
> > > > > > > > > > > > > > > > >> because
> > > > > > > > > > > > > > > > >> >> > > > ideally
> > > > > > > > > > > > > > > > >> >> > > > > >> when you add a new non-fatal
> > > > > exception
> > > > > > > in
> > > > > > > > > > future
> > > > > > > > > > > > you
> > > > > > > > > > > > > > just
> > > > > > > > > > > > > > > > >> need
> > > > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > > > >> >> > > > > >> about letting it inherit
> from
> > > the
> > > > > > super
> > > > > > > > > > > exception
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> >> rest
> > > > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > > > >> >> > > > > >> the code will just behave
> > > > correctly
> > > > > > > > without
> > > > > > > > > > the
> > > > > > > > > > > > need
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > wrap
> > > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > > >> >> > new
> > > > > > > > > > > > > > > > >> >> > > > > >> exception into another
> exception
> > > > > each
> > > > > > > time
> > > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > >> (maybe
> > > > > > > > > > > > > > > > >> >> it
> > > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > > >> >> > > > > >> thrown at different
> location in
> > > > the
> > > > > > > code).
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >> As far as I understand the
> > > > following
> > > > > > > > > statement
> > > > > > > > > > > > from
> > > > > > > > > > > > > > your
> > > > > > > > > > > > > > > > >> >> previous
> > > > > > > > > > > > > > > > >> >> > > > e-mail
> > > > > > > > > > > > > > > > >> >> > > > > >> is the reason that currently
> > > such
> > > > a
> > > > > > > super
> > > > > > > > > > > > exception
> > > > > > > > > > > > > > is not
> > > > > > > > > > > > > > > > >> >> > possible:
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >> "Right now we have
> > > > > RetriableException
> > > > > > > > type,
> > > > > > > > > if
> > > > > > > > > > > we
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > going
> > > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > > >> >> > add a
> > > > > > > > > > > > > > > > >> >> > > > > >> `ProducerRetriableException`
> > > type,
> > > > > we
> > > > > > > have
> > > > > > > > > to
> > > > > > > > > > > put
> > > > > > > > > > > > > > this new
> > > > > > > > > > > > > > > > >> >> > interface
> > > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > > >> >> > > > > >> the parent of the
> > > > > RetriableException,
> > > > > > > > > because
> > > > > > > > > > > not
> > > > > > > > > > > > > all
> > > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > > > >> >> > > > > >> exceptions are `retriable`
> in
> > > > > general
> > > > > > > for
> > > > > > > > > > > > producer"
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >> In the list of exceptions in
> > > your
> > > > > > KIP, I
> > > > > > > > > found
> > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > >> >> > exceptions
> > > > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > > > >> >> > > > > >> do not inherit from
> > > > > > RetriableException.
> > > > > > > I
> > > > > > > > > > guess
> > > > > > > > > > > > > those
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> >> ones
> > > > > > > > > > > > > > > > >> >> > > you
> > > > > > > > > > > > > > > > >> >> > > > > >> are referring to in your
> > > > statement:
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >>
> InvalidProducerEpochException
> > > > > > > > > > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > > > > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >> All of those exceptions are
> > > > > non-fatal
> > > > > > > and
> > > > > > > > do
> > > > > > > > > > not
> > > > > > > > > > > > > > inherit
> > > > > > > > > > > > > > > > >> from
> > > > > > > > > > > > > > > > >> >> > > > > >> RetriableException. Is
> there a
> > > > > reason
> > > > > > > for
> > > > > > > > > > that?
> > > > > > > > > > > If
> > > > > > > > > > > > > > they
> > > > > > > > > > > > > > > > >> >> depended
> > > > > > > > > > > > > > > > >> >> > > from
> > > > > > > > > > > > > > > > >> >> > > > > >> RetriableException we would
> be a
> > > > bit
> > > > > > > > closer
> > > > > > > > > > to a
> > > > > > > > > > > > > super
> > > > > > > > > > > > > > > > >> >> exception I
> > > > > > > > > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >> The reason is that sender
> may
> > > > catch
> > > > > > > those
> > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> >> > > > > > ProduceResponse, and it
> currently
> > > > > does
> > > > > > > > > infinite
> > > > > > > > > > > > > > > > >> >> > > > > > retries on
> RetriableException. To
> > > > > make
> > > > > > > sure
> > > > > > > > > we
> > > > > > > > > > > > could
> > > > > > > > > > > > > > > > trigger
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> >> > > > > > abortTransaction() by
> catching
> > > > > > non-fatal
> > > > > > > > > thrown
> > > > > > > > > > > > > > > > >> >> > > > > > exceptions and reinitialize
> the
> > > txn
> > > > > > > state,
> > > > > > > > we
> > > > > > > > > > > chose
> > > > > > > > > > > > > > not to
> > > > > > > > > > > > > > > > >> let
> > > > > > > > > > > > > > > > >> >> > those
> > > > > > > > > > > > > > > > >> >> > > > > > exceptions inherit
> > > > RetriableException
> > > > > > so
> > > > > > > > that
> > > > > > > > > > > > > > > > >> >> > > > > > they won't cause infinite
> retry
> > > on
> > > > > > > sender.
> > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > >> >> > > > > >> With
> OutOfOrderSequenceException
> > > > and
> > > > > > > > > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > > > >> >> > > > > >> to understand that their
> > > fatality
> > > > > > > depends
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > > > > type
> > > > > > > > > > > > > > > > (i.e.
> > > > > > > > > > > > > > > > >> >> > > > > >> configuration) of the
> producer.
> > > > That
> > > > > > > makes
> > > > > > > > > it
> > > > > > > > > > > > > > difficult to
> > > > > > > > > > > > > > > > >> >> have a
> > > > > > > > > > > > > > > > >> >> > > > super
> > > > > > > > > > > > > > > > >> >> > > > > >> exception that encodes the
> > > > > > retriability
> > > > > > > as
> > > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > > above.
> > > > > > > > > > > > > > > > >> >> Would
> > > > > > > > > > > > > > > > >> >> > it
> > > > > > > > > > > > > > > > >> >> > > > be
> > > > > > > > > > > > > > > > >> >> > > > > >> possible to introduce
> exceptions
> > > > > that
> > > > > > > > > inherit
> > > > > > > > > > > from
> > > > > > > > > > > > > > > > >> >> > > RetriableException
> > > > > > > > > > > > > > > > >> >> > > > > >> and that are thrown when
> those
> > > > > > > exceptions
> > > > > > > > > are
> > > > > > > > > > > > caught
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> >> > > brokers
> > > > > > > > > > > > > > > > >> >> > > > > >> and the type of the
> producer is
> > > > such
> > > > > > > that
> > > > > > > > > the
> > > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > > >> >> > > > retriable?
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >> Yea, I think in general the
> > > > > exception
> > > > > > > type
> > > > > > > > > > > mixing
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > > >> difficult
> > > > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > > > >> >> > > get
> > > > > > > > > > > > > > > > >> >> > > > > > them all right. I have
> already
> > > > > proposed
> > > > > > > > > another
> > > > > > > > > > > > > > solution
> > > > > > > > > > > > > > > > >> based
> > > > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > > > >> >> > my
> > > > > > > > > > > > > > > > >> >> > > > > > offline discussion with some
> > > folks
> > > > > > > working
> > > > > > > > on
> > > > > > > > > > EOS
> > > > > > > > > > > > > > > > >> >> > > > > > to make the handling more
> > > > > > straightforward
> > > > > > > > for
> > > > > > > > > > end
> > > > > > > > > > > > > users
> > > > > > > > > > > > > > > > >> without
> > > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > > >> >> > > > need
> > > > > > > > > > > > > > > > >> >> > > > > > to distinguish exception
> > > fatality.
> > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang
> Wang
> > > > > > wrote:
> > > > > > > > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the
> > > proposal!
> > > > I
> > > > > > > made a
> > > > > > > > > > pass
> > > > > > > > > > > > over
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > >> list
> > > > > > > > > > > > > > > > >> >> and
> > > > > > > > > > > > > > > > >> >> > > here
> > > > > > > > > > > > > > > > >> >> > > > > are
> > > > > > > > > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> > 0) Although this is not
> part
> > > of
> > > > > the
> > > > > > > > public
> > > > > > > > > > > API,
> > > > > > > > > > > > I
> > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > >> >> > should
> > > > > > > > > > > > > > > > >> >> > > > make
> > > > > > > > > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > > > > > > > > >> >> > > > > >> > that the suggested pattern
> > > (i.e.
> > > > > > user
> > > > > > > > can
> > > > > > > > > > > always
> > > > > > > > > > > > > > call
> > > > > > > > > > > > > > > > >> >> abortTxn()
> > > > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > > > >> >> > > > > >> > handling non-fatal
> errors) are
> > > > > > indeed
> > > > > > > > > > > supported.
> > > > > > > > > > > > > > E.g. if
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> >> txn
> > > > > > > > > > > > > > > > >> >> > > is
> > > > > > > > > > > > > > > > >> >> > > > > >> already
> > > > > > > > > > > > > > > > >> >> > > > > >> > aborted by the broker
> side,
> > > then
> > > > > > users
> > > > > > > > can
> > > > > > > > > > > still
> > > > > > > > > > > > > > call
> > > > > > > > > > > > > > > > >> >> abortTxn
> > > > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > > > >> >> > > > > >> would
> > > > > > > > > > > > > > > > >> >> > > > > >> > not throw another
> exception
> > > but
> > > > > just
> > > > > > > be
> > > > > > > > > > > treated
> > > > > > > > > > > > > as a
> > > > > > > > > > > > > > > > >> no-op.
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> > 1)
> > > > > > *ConcurrentTransactionsException*:
> > > > > > > I
> > > > > > > > > > think
> > > > > > > > > > > > this
> > > > > > > > > > > > > > error
> > > > > > > > > > > > > > > > >> can
> > > > > > > > > > > > > > > > >> >> > also
> > > > > > > > > > > > > > > > >> >> > > be
> > > > > > > > > > > > > > > > >> >> > > > > >> > returned but not
> documented
> > > yet.
> > > > > > This
> > > > > > > > > should
> > > > > > > > > > > be
> > > > > > > > > > > > a
> > > > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > > > >> >> > error.
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> > 2)
> *InvalidTxnStateException*:
> > > > > this
> > > > > > > > error
> > > > > > > > > is
> > > > > > > > > > > > > > returned
> > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > >> >> > broker
> > > > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > > > > > > > > >> >> > > > > >> > state transition failed
> (e.g.
> > > it
> > > > > is
> > > > > > > > trying
> > > > > > > > > > to
> > > > > > > > > > > > > > transit to
> > > > > > > > > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > > > > > > > > >> >> > > > > >> > while the current state
> is not
> > > > > > > > > > > prepare-commit).
> > > > > > > > > > > > > This
> > > > > > > > > > > > > > > > error
> > > > > > > > > > > > > > > > >> >> could
> > > > > > > > > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > > > > > > > > >> >> > > > > >> > a bug on the client
> internal
> > > > code
> > > > > or
> > > > > > > the
> > > > > > > > > > > broker
> > > > > > > > > > > > > > code,
> > > > > > > > > > > > > > > > OR a
> > > > > > > > > > > > > > > > >> >> user
> > > > > > > > > > > > > > > > >> >> > > > error
> > > > > > > > > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > > > > > > > > >> >> > > > > >> > similar error is
> > > > > > > > > > > > ConcurrentTransactionsException,
> > > > > > > > > > > > > > i.e.
> > > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > >> >> Kafka
> > > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > > > > > > > > >> >> > > > > >> > these exceptions would
> only be
> > > > > > > returned
> > > > > > > > if
> > > > > > > > > > > users
> > > > > > > > > > > > > > try to
> > > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > > >> >> > > something
> > > > > > > > > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn
> right
> > > > after
> > > > > a
> > > > > > > > > > commitTxn,
> > > > > > > > > > > > > etc.
> > > > > > > > > > > > > > So
> > > > > > > > > > > > > > > > I'm
> > > > > > > > > > > > > > > > >> >> > > thinking
> > > > > > > > > > > > > > > > >> >> > > > it
> > > > > > > > > > > > > > > > >> >> > > > > >> > should be a non-fatal
> error
> > > > > instead
> > > > > > > of a
> > > > > > > > > > fatal
> > > > > > > > > > > > > > error,
> > > > > > > > > > > > > > > > >> wdyt?
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case
> i
> > > > > > "indicates
> > > > > > > > > fatal
> > > > > > > > > > > > > > > > transactional
> > > > > > > > > > > > > > > > >> >> > > sequence
> > > > > > > > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit
> > > > > conflicting
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > > > > > > > > >> >> > > > > >> that
> > > > > > > > > > > > > > > > >> >> > > > > >> > is treated as non-fatal. I
> > > guess
> > > > > > your
> > > > > > > > > > proposal
> > > > > > > > > > > > is
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> OutOfOrderSequenceException
> > > > would
> > > > > be
> > > > > > > > > treated
> > > > > > > > > > > > > either
> > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > >> fatal
> > > > > > > > > > > > > > > > >> >> > with
> > > > > > > > > > > > > > > > >> >> > > > > >> > transactional producer, or
> > > > > non-fatal
> > > > > > > > with
> > > > > > > > > > > > > idempotent
> > > > > > > > > > > > > > > > >> >> producer,
> > > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > > > >> >> > > > > >> > right? If the producer is
> only
> > > > > > > > configured
> > > > > > > > > > with
> > > > > > > > > > > > > > > > idempotency
> > > > > > > > > > > > > > > > >> >> but
> > > > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > > > >> >> > > > > >> > transaction, then
> throwing a
> > > > > > > > > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > > > > > > > > >> >> > > for
> > > > > > > > > > > > > > > > >> >> > > > > >> > non-fatal errors would be
> > > > > confusing
> > > > > > > > since
> > > > > > > > > > > users
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > >> using
> > > > > > > > > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > > > > > > > > >> >> > > > > >> > at all.. So I suggest we
> > > always
> > > > > > throw
> > > > > > > > > > > > > > > > >> OutOfSequenceException
> > > > > > > > > > > > > > > > >> >> > as-is
> > > > > > > > > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > > > > > > > > >> >> > > > > >> > not wrapped) no matter
> how the
> > > > > > > producer
> > > > > > > > is
> > > > > > > > > > > > > > configured,
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > >> >> let
> > > > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > > > > > > > > >> >> > > > > >> > decide how to handle it
> based
> > > on
> > > > > > > whether
> > > > > > > > > it
> > > > > > > > > > is
> > > > > > > > > > > > > only
> > > > > > > > > > > > > > > > >> >> idempotent
> > > > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn
> APIs,
> > > the
> > > > > > > > `send()`
> > > > > > > > > > > > > callback /
> > > > > > > > > > > > > > > > >> future
> > > > > > > > > > > > > > > > >> >> can
> > > > > > > > > > > > > > > > >> >> > > also
> > > > > > > > > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > > > > > > > > >> >> > > > > >> > txn-related exceptions, I
> > > think
> > > > > this
> > > > > > > KIP
> > > > > > > > > > > should
> > > > > > > > > > > > > also
> > > > > > > > > > > > > > > > cover
> > > > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > > > >> >> > > API
> > > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> > 5) This is related to 1/2)
> > > > above:
> > > > > > > > > sometimes
> > > > > > > > > > > > those
> > > > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > > > >> >> > > > > like
> > > > > > > > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or
> > > InvalidTxnState
> > > > > are
> > > > > > > not
> > > > > > > > > due
> > > > > > > > > > > to
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > state
> > > > > > > > > > > > > > > > >> >> being
> > > > > > > > > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > > > > > > > > >> >> > > > > >> > at the broker side, but
> maybe
> > > > > users
> > > > > > > are
> > > > > > > > > > doing
> > > > > > > > > > > > > > something
> > > > > > > > > > > > > > > > >> >> wrong.
> > > > > > > > > > > > > > > > >> >> > So
> > > > > > > > > > > > > > > > >> >> > > > I'm
> > > > > > > > > > > > > > > > >> >> > > > > >> > wondering if we should
> further
> > > > > > > > distinguish
> > > > > > > > > > > those
> > > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > > > >> >> > > > > >> between
> > > > > > > > > > > > > > > > >> >> > > > > >> > a) those that are caused
> by
> > > > Kafka
> > > > > > > > itself,
> > > > > > > > > > > e.g. a
> > > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > > >> timed
> > > > > > > > > > > > > > > > >> >> > out
> > > > > > > > > > > > > > > > >> >> > > > and
> > > > > > > > > > > > > > > > >> >> > > > > >> > aborted a txn and later an
> > > > endTxn
> > > > > > > > request
> > > > > > > > > is
> > > > > > > > > > > > > > received,
> > > > > > > > > > > > > > > > >> and b)
> > > > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > > > >> >> > > > > user's
> > > > > > > > > > > > > > > > >> >> > > > > >> > API call pattern is
> incorrect,
> > > > > > causing
> > > > > > > > the
> > > > > > > > > > > > request
> > > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > >> >> > rejected
> > > > > > > > > > > > > > > > >> >> > > > with
> > > > > > > > > > > > > > > > >> >> > > > > >> an
> > > > > > > > > > > > > > > > >> >> > > > > >> > error code from the
> broker.
> > > > > > > > > > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > > > > > > > > > >> >> > > > feels
> > > > > > > > > > > > > > > > >> >> > > > > >> to
> > > > > > > > > > > > > > > > >> >> > > > > >> > me more like for case a),
> but
> > > > not
> > > > > > case
> > > > > > > > b).
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at
> 4:50 PM
> > > > > > Boyang
> > > > > > > > > Chen <
> > > > > > > > > > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think
> your
> > > > > > > proposal
> > > > > > > > > > makes
> > > > > > > > > > > > > sense
> > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > > >> well,
> > > > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > >> >> > > > > pro
> > > > > > > > > > > > > > > > >> >> > > > > >> side
> > > > > > > > > > > > > > > > >> >> > > > > >> >> we could have a
> universally
> > > > > agreed
> > > > > > > > > > exception
> > > > > > > > > > > > type
> > > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > > >> >> caught
> > > > > > > > > > > > > > > > >> >> > by
> > > > > > > > > > > > > > > > >> >> > > > the
> > > > > > > > > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > > > > > > > > >> >> > > > > >> >> without having an extra
> layer
> > > > on
> > > > > > top
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > > actual
> > > > > > > > > > > > > > > > >> >> exceptions.
> > > > > > > > > > > > > > > > >> >> > I
> > > > > > > > > > > > > > > > >> >> > > > > could
> > > > > > > > > > > > > > > > >> >> > > > > >> see
> > > > > > > > > > > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > > > > >> >> 1. The exception
> hierarchy
> > > will
> > > > > be
> > > > > > > more
> > > > > > > > > > > > complex.
> > > > > > > > > > > > > > Right
> > > > > > > > > > > > > > > > >> now
> > > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > > >> >> > > have
> > > > > > > > > > > > > > > > >> >> > > > > >> >> RetriableException type,
> if
> > > we
> > > > > are
> > > > > > > > going
> > > > > > > > > to
> > > > > > > > > > > > add a
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> `ProducerRetriableException`
> > > > > type,
> > > > > > we
> > > > > > > > > have
> > > > > > > > > > to
> > > > > > > > > > > > put
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > >> new
> > > > > > > > > > > > > > > > >> >> > > > interface
> > > > > > > > > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > > > > > > > > >> >> > > > > >> >> parent of the
> > > > RetriableException,
> > > > > > > > because
> > > > > > > > > > not
> > > > > > > > > > > > all
> > > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > > > >> >> > > > > >> >> exceptions are
> `retriable` in
> > > > > > general
> > > > > > > > for
> > > > > > > > > > > > > > producer, for
> > > > > > > > > > > > > > > > >> >> example
> > > > > > > > > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > > > > > > > > >> >> > > > > >> >> We could have an empty
> > > > interface
> > > > > > > > > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > > > >> >> > > > let
> > > > > > > > > > > > > > > > >> >> > > > > >> all
> > > > > > > > > > > > > > > > >> >> > > > > >> >> the thrown exceptions
> > > implement
> > > > > for
> > > > > > > > sure,
> > > > > > > > > > but
> > > > > > > > > > > > > it's
> > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > bit
> > > > > > > > > > > > > > > > >> >> > > unorthodox
> > > > > > > > > > > > > > > > >> >> > > > > in
> > > > > > > > > > > > > > > > >> >> > > > > >> the
> > > > > > > > > > > > > > > > >> >> > > > > >> >> way we deal with
> exceptions
> > > in
> > > > > > > general.
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > > > > >> >> 2. There are cases where
> we
> > > > > throw a
> > > > > > > > > > > > > KafkaException
> > > > > > > > > > > > > > > > >> wrapping
> > > > > > > > > > > > > > > > >> >> > > another
> > > > > > > > > > > > > > > > >> >> > > > > >> >> KafkaException as either
> > > fatal
> > > > or
> > > > > > > > > > non-fatal.
> > > > > > > > > > > If
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > use
> > > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > >> >> > > interface
> > > > > > > > > > > > > > > > >> >> > > > > to
> > > > > > > > > > > > > > > > >> >> > > > > >> >> solve #1, it is also
> required
> > > > to
> > > > > > > > > implement
> > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > >> bloated
> > > > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > > > >> >> > > > > >> >> which could replace
> > > > > KafkaException
> > > > > > > > type,
> > > > > > > > > as
> > > > > > > > > > > we
> > > > > > > > > > > > > > couldn't
> > > > > > > > > > > > > > > > >> mark
> > > > > > > > > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > > > > >> >> 3. In terms of the
> > > > encapsulation,
> > > > > > > > > wrapping
> > > > > > > > > > > > means
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > > >> >> limit
> > > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > > > > > > > > >> >> > > > > >> >> of affection to the
> producer
> > > > > only,
> > > > > > > > which
> > > > > > > > > is
> > > > > > > > > > > > > > important
> > > > > > > > > > > > > > > > >> since
> > > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > > >> >> > > > don't
> > > > > > > > > > > > > > > > >> >> > > > > >> want
> > > > > > > > > > > > > > > > >> >> > > > > >> >> shared exception types to
> > > > > > implement a
> > > > > > > > > > > > > > producer-related
> > > > > > > > > > > > > > > > >> >> > interface,
> > > > > > > > > > > > > > > > >> >> > > > > such
> > > > > > > > > > > > > > > > >> >> > > > > >> >> as
> > > > > > UnknownTopicOrPartitionException.
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at
> 3:38
> > > PM
> > > > > > > Matthias
> > > > > > > > > J.
> > > > > > > > > > > Sax
> > > > > > > > > > > > <
> > > > > > > > > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP
> Boyang!
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> Overall, categorizing
> > > > exceptions
> > > > > > > > makes a
> > > > > > > > > > lot
> > > > > > > > > > > > of
> > > > > > > > > > > > > > sense.
> > > > > > > > > > > > > > > > >> As I
> > > > > > > > > > > > > > > > >> >> > > don't
> > > > > > > > > > > > > > > > >> >> > > > > know
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> the producer internals
> well
> > > > > > enough,
> > > > > > > I
> > > > > > > > > > cannot
> > > > > > > > > > > > > > comment
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> categorization in detail
> > > > though.
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> What I am wondering is,
> if
> > > we
> > > > > > should
> > > > > > > > > > > introduce
> > > > > > > > > > > > > an
> > > > > > > > > > > > > > > > >> exception
> > > > > > > > > > > > > > > > >> >> > > > > interface
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> that non-fatal exception
> > > would
> > > > > > > > implement
> > > > > > > > > > > > instead
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > >> >> creating a
> > > > > > > > > > > > > > > > >> >> > > new
> > > > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal
> > > > > > exceptions?
> > > > > > > > > What
> > > > > > > > > > > > would
> > > > > > > > > > > > > > be the
> > > > > > > > > > > > > > > > >> >> > pros/cons
> > > > > > > > > > > > > > > > >> >> > > > for
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM,
> Boyang
> > > > Chen
> > > > > > > > wrote:
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>> I would like to start a
> > > > > > discussion
> > > > > > > > > thread
> > > > > > > > > > > for
> > > > > > > > > > > > > > > > KIP-691:
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to
> > > simplify
> > > > > the
> > > > > > > > > > exception
> > > > > > > > > > > > > > handling
> > > > > > > > > > > > > > > > >> logic
> > > > > > > > > > > > > > > > >> >> > for
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>> transactional Producer
> > > users
> > > > by
> > > > > > > > > > classifying
> > > > > > > > > > > > > > fatal and
> > > > > > > > > > > > > > > > >> >> > non-fatal
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>> and throw them
> > > > correspondingly
> > > > > > for
> > > > > > > > > easier
> > > > > > > > > > > > catch
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > >> retry.
> > > > > > > > > > > > > > > > >> >> > Let
> > > > > > > > > > > > > > > > >> >> > > me
> > > > > > > > > > > > > > > > >> >> > > > > >> know
> > > > > > > > > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > > > --
> > > > > > > > > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > >> >> > --
> > > > > > > > > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >> > --
> > > > > > > > > > > > > > > > >> > -- Guozhang
> > > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > > >> -- Guozhang
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Chia-Ping Tsai <ch...@apache.org>.
> a richer return type. Let me know if you are good with this, and whether
> Chia-Ping is also happy here :)

sure. I'd like to join this bigger party :)

On 2021/01/30 01:03:57, Boyang Chen <re...@gmail.com> wrote: 
> This is a great proposal Jason, I already integrated KIP-691 with KIP-706
> template to provide a new ProduceFailedException as well as a new
> Producer#produce API with CompletionStage as old send API replacement with
> a richer return type. Let me know if you are good with this, and whether
> Chia-Ping is also happy here :)
> 
> Best,
> Boyang
> 
> On Fri, Jan 29, 2021 at 4:23 PM Jason Gustafson <ja...@confluent.io> wrote:
> 
> > Hi Boyang,
> >
> > Ok, if we are going to ask all users to update their code anyway, maybe we
> > can get more bang out of this.
> >
> > One of the remaining problems is that a user will only get the richer
> > information if they are supplying a Callback. If they wait on the future,
> > then they will get the usual exception. Our hands are somewhat tied here
> > because of compatibility. One thing we have wanted to do for a long time is
> > introduce a new send method which returns
> > `CompletableFuture<RecordMetadata>` instead. If we did so, then we wouldn't
> > need to change `Callback` at all. Instead we could make the contract of the
> > new api that it will always return `SendFailedException`, which could
> > provide a method to get the failure type.
> >
> > I think Chia-Ping has been thinking about this:
> > https://issues.apache.org/jira/browse/KAFKA-12227. Is there a way to
> > consolidate our efforts in order to avoid unnecessary noise for users? It
> > would be awesome to see a more modern API which also addresses the
> > confusing error handling.
> >
> > -Jason
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jan 29, 2021 at 3:48 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Thanks Jason, my hope is that we could streamline the callback function
> > > signature, so that in the next major release (3.0),
> > > onCompletion(RecordMetadata metadata, SendFailure sendFailure) will
> > become
> > > the only API to be implemented.
> > >
> > > On Fri, Jan 29, 2021 at 3:42 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Boyang,
> > > >
> > > > Do you think it's necessary to deprecate the old `onCompletion`
> > > callback? I
> > > > was thinking there's probably no harm leaving it around. Users might
> > not
> > > > care about the failure type. Other than that, it looks good to me.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Thu, Jan 28, 2021 at 6:58 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > Hey Guozhang,
> > > > >
> > > > > the CommitFailedException wrapping would cover all the non-fatal
> > > > exceptions
> > > > > as we listed out in the KIP, generally speaking any exception that
> > > could
> > > > > recover safely by calling abortTxn should be wrapped.
> > > > >
> > > > > Best,
> > > > > Boyang
> > > > >
> > > > > On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <wa...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hey Boyang,
> > > > > >
> > > > > > I think maybe there's something cracking here :) I'm just asking
> > for
> > > > > > clarifications that as of today, which non-fatal exceptions the
> > newly
> > > > > > introduced CommitFailedException would cover, and it seems to be
> > only
> > > > 1)
> > > > > > unknown pid, 2) invalid pid mapping, and 3) concurrent
> > transactions.
> > > Is
> > > > > > that correct?
> > > > > >
> > > > > > On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <
> > > > reluctanthero104@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Guozhang, I think TimeoutException would not be covered here
> > as
> > > > it
> > > > > > > potentially has a risk of hitting an illegal state on the broker
> > > side
> > > > > > when
> > > > > > > the previous commit was actually successful. Users should try to
> > > > > increase
> > > > > > > their max.block.ms to avoid hitting the timeout as a base
> > > > suggestion,
> > > > > > > which
> > > > > > > is discussed in the KIP.
> > > > > > >
> > > > > > > As for the umbrella exception, I agree it has values to some
> > > extent,
> > > > > but
> > > > > > in
> > > > > > > terms of adoption difficulty, additional exception types usually
> > > make
> > > > > the
> > > > > > > error handling more complex than simplified, and we are doing our
> > > > best
> > > > > to
> > > > > > > avoid compatibility issues caused by wrapping previously thrown
> > raw
> > > > > > > exceptions that invalidates user error handling. And I'm also not
> > > in
> > > > > > favor
> > > > > > > of our exception hierarchy today where all exceptions are
> > > subclasses
> > > > of
> > > > > > > KafkaException, and we already mixed its handling with other
> > > specific
> > > > > > > exception types that could be caught as superclass
> > KafkaException.
> > > > That
> > > > > > > could be addressed in a separate proposal if we see necessary.
> > > > > > >
> > > > > > > Boyang
> > > > > > >
> > > > > > > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <
> > wangguoz@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Jason for the suggestion, that looks good to me too.
> > > > > > > >
> > > > > > > > Regarding the non-fatal exceptions wrapped as CommitFailed, I
> > > would
> > > > > > like
> > > > > > > to
> > > > > > > > clarify if we would cover all the following cases: 1) timeout,
> > 2)
> > > > > > unknown
> > > > > > > > pid, 3) invalid pid mapping, 4) concurrent transactions?
> > > > > > > >
> > > > > > > > BTW I think it still makes sense to use an umbrella exception
> > in
> > > > case
> > > > > > in
> > > > > > > > the future we add more non-fatal cases even though today we
> > only
> > > > > have a
> > > > > > > > few.
> > > > > > > >
> > > > > > > >
> > > > > > > > Guozhang
> > > > > > > >
> > > > > > > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <
> > > > > > reluctanthero104@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Jason, I agree with the proposed solution here, will
> > > > update
> > > > > > the
> > > > > > > > KIP.
> > > > > > > > >
> > > > > > > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <
> > > > > jason@confluent.io
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Boyang,
> > > > > > > > > >
> > > > > > > > > > It seems like a reasonable suggestion. I wonder if a flag
> > is
> > > > > > > sufficient
> > > > > > > > > > though. The current `Callback` documentation treats "fatal"
> > > > > errors
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > perspective of the individual message that was sent.
> > > > > > > > > >
> > > > > > > > > > ```
> > > > > > > > > >     *                  Non-Retriable exceptions (fatal, the
> > > > > message
> > > > > > > > will
> > > > > > > > > > never be sent):
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > However, we also have fatal errors from the perspective of
> > > the
> > > > > > > > > transaction
> > > > > > > > > > (e.g. when the producer gets fenced). Perhaps that suggests
> > > we
> > > > > need
> > > > > > > > > > something richer than a boolean flag:
> > > > > > > > > >
> > > > > > > > > > At a high level, I think the following cases are possible:
> > > > > > > > > >
> > > > > > > > > > - message rejected (e.g. message too large, invalid topic)
> > > > > > > > > > - delivery failed after retries/delivery timeout (e.g.
> > > timeout,
> > > > > crc
> > > > > > > > > error,
> > > > > > > > > > not enough replicas)
> > > > > > > > > > - transaction failed (e.g. producer fenced, invalid
> > > transaction
> > > > > > > state)
> > > > > > > > > >
> > > > > > > > > > Perhaps instead we can have a type like the following:
> > > > > > > > > >
> > > > > > > > > > class SendFailure {
> > > > > > > > > >   FailureType failureType;
> > > > > > > > > >   Exception cause;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > enum FailureType {
> > > > > > > > > >   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > (Not married to any of these names, just a starting point.)
> > > > > > > > > >
> > > > > > > > > > Then we add a new `onCompletion` as you've suggested:
> > > > > > > > > >
> > > > > > > > > > default void onCompletion(RecordMetadata metadata,
> > > SendFailure
> > > > > > > > failure) {
> > > > > > > > > >   this.onCompletion(metadata, failure.cause());
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > This would give streams and other applications enough
> > > > information
> > > > > > to
> > > > > > > > know
> > > > > > > > > > whether the message can be retried and whether the
> > > transaction
> > > > > can
> > > > > > be
> > > > > > > > > > aborted.
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > -Jason
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks Jason for the thoughts.
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <
> > > > > > > jason@confluent.io
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Boyang,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the iterations here. I think this is
> > something
> > > > we
> > > > > > > should
> > > > > > > > > > have
> > > > > > > > > > > > done a long time ago. It sounds like there are two API
> > > > > changes
> > > > > > > > here:
> > > > > > > > > > > >
> > > > > > > > > > > > 1. We are introducing the `CommitFailedException` to
> > wrap
> > > > > > > abortable
> > > > > > > > > > > > errors that are raised from `commitTransaction`. This
> > > > sounds
> > > > > > fine
> > > > > > > > to
> > > > > > > > > > me.
> > > > > > > > > > > As
> > > > > > > > > > > > far as I know, the only case we might need this is when
> > > we
> > > > > add
> > > > > > > > > support
> > > > > > > > > > to
> > > > > > > > > > > > let producers recover from coordinator timeouts. Are
> > > there
> > > > > any
> > > > > > > > > others?
> > > > > > > > > > > >
> > > > > > > > > > > > I think the purpose here is to ensure non-fatal
> > > exceptions
> > > > > are
> > > > > > > > > unified
> > > > > > > > > > > under the same
> > > > > > > > > > > exception umbrella, to make the proceeding to abort any
> > > > ongoing
> > > > > > > > > > transaction
> > > > > > > > > > > much easier.
> > > > > > > > > > > I don't think `coordinator timeouts` is the only case to
> > > > > recover
> > > > > > > > here,
> > > > > > > > > > > since we have other
> > > > > > > > > > > non-fatal exceptions such as UnknownPid.
> > > > > > > > > > >
> > > > > > > > > > > 2. We are wrapping non-fatal errors raised from `send` as
> > > > > > > > > > `KafkaException`.
> > > > > > > > > > > > The motivation for this is less clear to me and it
> > > doesn't
> > > > > look
> > > > > > > > like
> > > > > > > > > > the
> > > > > > > > > > > > example from the KIP depends on it. My concern here is
> > > > > > > > compatibility.
> > > > > > > > > > > > Currently we have the following documentation for the
> > > > > > `Callback`
> > > > > > > > API:
> > > > > > > > > > > >
> > > > > > > > > > > > ```
> > > > > > > > > > > >      *                  Non-Retriable exceptions
> > (fatal,
> > > > the
> > > > > > > > message
> > > > > > > > > > will
> > > > > > > > > > > > never be sent):
> > > > > > > > > > > >      *
> > > > > > > > > > > >      *                  InvalidTopicException
> > > > > > > > > > > >      *                  OffsetMetadataTooLargeException
> > > > > > > > > > > >      *                  RecordBatchTooLargeException
> > > > > > > > > > > >      *                  RecordTooLargeException
> > > > > > > > > > > >      *                  UnknownServerException
> > > > > > > > > > > >      *                  UnknownProducerIdException
> > > > > > > > > > > >      *                  InvalidProducerEpochException
> > > > > > > > > > > >      *
> > > > > > > > > > > >      *                  Retriable exceptions
> > (transient,
> > > > may
> > > > > be
> > > > > > > > > covered
> > > > > > > > > > > by
> > > > > > > > > > > > increasing #.retries):
> > > > > > > > > > > >      *
> > > > > > > > > > > >      *                  CorruptRecordException
> > > > > > > > > > > >      *                  InvalidMetadataException
> > > > > > > > > > > >      *
> > > > NotEnoughReplicasAfterAppendException
> > > > > > > > > > > >      *                  NotEnoughReplicasException
> > > > > > > > > > > >      *                  OffsetOutOfRangeException
> > > > > > > > > > > >      *                  TimeoutException
> > > > > > > > > > > >      *
> > UnknownTopicOrPartitionException
> > > > > > > > > > > > ```
> > > > > > > > > > > >
> > > > > > > > > > > > If we wrap all the retriable exceptions documented here
> > > as
> > > > > > > > > > > > `KafkaException`, wouldn't that break any error
> > handling
> > > > that
> > > > > > > users
> > > > > > > > > > might
> > > > > > > > > > > > already have? it's gonna introduce a compatibility
> > issue.
> > > > > > > > > > > >
> > > > > > > > > > > > The original intention was to simplify `send` callback
> > > > error
> > > > > > > > handling
> > > > > > > > > > by
> > > > > > > > > > > doing exception wrapping, as on Streams level
> > > > > > > > > > > we have to prepare an exhausting list of exceptions to
> > > catch
> > > > as
> > > > > > > > fatal,
> > > > > > > > > > and
> > > > > > > > > > > the same lengthy list to catch as
> > > > > > > > > > > non-fatal. It would be much easier if we got `hints` from
> > > the
> > > > > > > > callback.
> > > > > > > > > > > However,
> > > > > > > > > > > I agree there is a compatibility concern, what about
> > > > > deprecating
> > > > > > > the
> > > > > > > > > > > existing:
> > > > > > > > > > >
> > > > > > > > > > > void onCompletion(RecordMetadata metadata, Exception
> > > > exception)
> > > > > > > > > > >
> > > > > > > > > > > and replace it with:
> > > > > > > > > > >
> > > > > > > > > > > default void onCompletion(RecordMetadata metadata,
> > > Exception
> > > > > > > > exception,
> > > > > > > > > > > boolean isFatal) {
> > > > > > > > > > >   this.onCompletion(metadata, exception);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > to make sure new users get the benefit of understanding
> > the
> > > > > > > fatality
> > > > > > > > > > based
> > > > > > > > > > > on the info presented by the producer?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > > Jason
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <
> > > > > info@hiringuru.com>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Why  we are receiving all emails kindly remove us
> > from
> > > > > > > > > > > > > dev@kafka.apache.org we don't want to receive emails
> > > > > > anymore.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <
> > > > wangguoz@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks Boyang, yes I think I was confused about the
> > > > > > different
> > > > > > > > > > > handling
> > > > > > > > > > > > of
> > > > > > > > > > > > > > two abortTxn calls, and now I get it was not
> > > > > intentional. I
> > > > > > > > > think I
> > > > > > > > > > > do
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > have more concerns.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the clarification Guozhang, I got your
> > > > point
> > > > > > > that
> > > > > > > > we
> > > > > > > > > > > want
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > have a consistent handling of fatal exceptions
> > > being
> > > > > > thrown
> > > > > > > > > from
> > > > > > > > > > > the
> > > > > > > > > > > > > > > abortTxn. I modified the current template to move
> > > the
> > > > > > fatal
> > > > > > > > > > > exception
> > > > > > > > > > > > > > > try-catch outside of the processing loop to make
> > > sure
> > > > > we
> > > > > > > > could
> > > > > > > > > > get
> > > > > > > > > > > a
> > > > > > > > > > > > > chance
> > > > > > > > > > > > > > > to close consumer/producer modules. Let me know
> > > what
> > > > > you
> > > > > > > > think.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > My understanding is that abortTransaction would
> > > > only
> > > > > > > throw
> > > > > > > > > when
> > > > > > > > > > > the
> > > > > > > > > > > > > > > > producer is in fatal state. For CommitFailed,
> > the
> > > > > > > producer
> > > > > > > > > > should
> > > > > > > > > > > > > still
> > > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > > in the abortable error state, so that
> > > > > abortTransaction
> > > > > > > call
> > > > > > > > > > would
> > > > > > > > > > > > not
> > > > > > > > > > > > > > > throw.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang
> > <
> > > > > > > > > > > wangguoz@gmail.com
> > > > > > > > > > > > >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >> Or are you going to maintain some internal
> > state
> > > > > such
> > > > > > > > that,
> > > > > > > > > > the
> > > > > > > > > > > > > > > >> `abortTransaction` in the catch block would
> > > never
> > > > > > throw
> > > > > > > > > again?
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang
> > Wang <
> > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > I've also thought about this (i.e. using
> > > > > > CommitFailed
> > > > > > > > for
> > > > > > > > > > all
> > > > > > > > > > > > > > > >> non-fatal),
> > > > > > > > > > > > > > > >> > but what I'm pondering is that, in the catch
> > > > > > > > > (CommitFailed)
> > > > > > > > > > > > block,
> > > > > > > > > > > > > > > what
> > > > > > > > > > > > > > > >> > would happen if the
> > > > `producer.abortTransaction();`
> > > > > > > > throws
> > > > > > > > > > > again?
> > > > > > > > > > > > > > > should
> > > > > > > > > > > > > > > >> > that be captured as a fatal and cause the
> > > client
> > > > > to
> > > > > > > > close
> > > > > > > > > > > again.
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > If yes, then naively the pattern would be:
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > > > > > > > >> >         // Transaction commit failed with
> > > > > abortable
> > > > > > > > error,
> > > > > > > > > > > user
> > > > > > > > > > > > > could
> > > > > > > > > > > > > > > >> reset
> > > > > > > > > > > > > > > >> >         // the application state and resume
> > > > with a
> > > > > > new
> > > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > >> > root
> > > > > > > > > > > > > > > >> >         // cause was wrapped in the thrown
> > > > > > exception.
> > > > > > > > > > > > > > > >> >
> > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > >> >         try {
> > > > > > > > > > > > > > > >> >             producer.abortTransaction();
> > > > > > > > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > > > > > > > >> >             producer.close();
> > > > > > > > > > > > > > > >> >             consumer.close();
> > > > > > > > > > > > > > > >> >             throw e;
> > > > > > > > > > > > > > > >> >         }
> > > > > > > > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > > > > > > > >> >         producer.close();
> > > > > > > > > > > > > > > >> >         consumer.close();
> > > > > > > > > > > > > > > >> >         throw e;
> > > > > > > > > > > > > > > >> >     }
> > > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > Guozhang
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang
> > Chen <
> > > > > > > > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> Jason and I were discussing the new API
> > > offline
> > > > > and
> > > > > > > > > decided
> > > > > > > > > > > to
> > > > > > > > > > > > > take
> > > > > > > > > > > > > > > >> >> another
> > > > > > > > > > > > > > > >> >> approach. Firstly, the reason not to
> > invent a
> > > > new
> > > > > > API
> > > > > > > > > with
> > > > > > > > > > > > > returned
> > > > > > > > > > > > > > > >> >> boolean
> > > > > > > > > > > > > > > >> >> flag is for compatibility consideration,
> > > since
> > > > > old
> > > > > > > EOS
> > > > > > > > > code
> > > > > > > > > > > > > would not
> > > > > > > > > > > > > > > >> know
> > > > > > > > > > > > > > > >> >> that a given transaction commit was failed
> > > > > > internally
> > > > > > > > as
> > > > > > > > > > they
> > > > > > > > > > > > > don't
> > > > > > > > > > > > > > > >> listen
> > > > > > > > > > > > > > > >> >> to the output flag. Our proposed
> > alternative
> > > > > > solution
> > > > > > > > is
> > > > > > > > > to
> > > > > > > > > > > let
> > > > > > > > > > > > > > > >> >> *commitTransaction
> > > > > > > > > > > > > > > >> >> throw CommitFailedException whenever the
> > > commit
> > > > > > > failed
> > > > > > > > > with
> > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > >> >> exception*, so that on the caller side the
> > > > > handling
> > > > > > > > logic
> > > > > > > > > > > > > becomes:
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> try {
> > > > > > > > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > > > > > > > >> >>             producer.commitTransaction();
> > > > > > > > > > > > > > > >> >>         } else {
> > > > > > > > > > > > > > > >> >>
> > > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > >> >>             producer.abortTransaction();
> > > > > > > > > > > > > > > >> >>         }
> > > > > > > > > > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > > > > > > > > > >> >>         // Transaction commit failed with
> > > > > abortable
> > > > > > > > > error,
> > > > > > > > > > > user
> > > > > > > > > > > > > could
> > > > > > > > > > > > > > > >> >> reset
> > > > > > > > > > > > > > > >> >>         // the application state and resume
> > > > with
> > > > > a
> > > > > > > new
> > > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > > The
> > > > > > > > > > > > > > > >> >> root
> > > > > > > > > > > > > > > >> >>         // cause was wrapped in the thrown
> > > > > > exception.
> > > > > > > > > > > > > > > >> >>
> > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > > > > > > > >> >>         producer.close();
> > > > > > > > > > > > > > > >> >>         consumer.close();
> > > > > > > > > > > > > > > >> >>         throw e;
> > > > > > > > > > > > > > > >> >>     }
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> This approach looks cleaner as all
> > exception
> > > > > types
> > > > > > > > other
> > > > > > > > > > than
> > > > > > > > > > > > > > > >> CommitFailed
> > > > > > > > > > > > > > > >> >> will doom to be fatal, which is very easy
> > to
> > > > > adopt
> > > > > > > for
> > > > > > > > > > users.
> > > > > > > > > > > > In
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > >> >> meantime, we still maintain the commitTxn
> > > > > behavior
> > > > > > to
> > > > > > > > > throw
> > > > > > > > > > > > > instead
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > >> >> silently failing.
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> In addition, we decided to drop the
> > > > > recommendation
> > > > > > to
> > > > > > > > > > handle
> > > > > > > > > > > > > > > >> >> TimeoutException and leave it to the users
> > to
> > > > > make
> > > > > > > the
> > > > > > > > > > call.
> > > > > > > > > > > > The
> > > > > > > > > > > > > > > >> downside
> > > > > > > > > > > > > > > >> >> for blindly calling abortTxn upon timeout
> > is
> > > > that
> > > > > > we
> > > > > > > > > could
> > > > > > > > > > > > > result in
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > >> >> illegal state when the commit was already
> > > > > > successful
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > >> >> side. Without a good guarantee on the
> > > outcome,
> > > > > > > > > > > overcomplicating
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> Let me know your thoughts on the new
> > approach
> > > > > here,
> > > > > > > > thank
> > > > > > > > > > > you!
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> Best,
> > > > > > > > > > > > > > > >> >> Boyang
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang
> > > Wang
> > > > <
> > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > >> >> wrote:
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >> > Thanks for your clarification on 2)/3),
> > > that
> > > > > > makes
> > > > > > > > > sense.
> > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang
> > > Chen
> > > > <
> > > > > > > > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > >> >> > > Thanks for the input Guozhang, replied
> > > > > inline.
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM
> > Guozhang
> > > > > Wang <
> > > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > Thanks for the updated KIP. I read it
> > > > again
> > > > > > and
> > > > > > > > > have
> > > > > > > > > > > the
> > > > > > > > > > > > > > > >> following
> > > > > > > > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > 0. I'm a bit concerned that if
> > > commitTxn
> > > > > does
> > > > > > > not
> > > > > > > > > > throw
> > > > > > > > > > > > any
> > > > > > > > > > > > > > > >> >> non-fatal
> > > > > > > > > > > > > > > >> >> > > > exception, and instead we rely on the
> > > > > > > subsequent
> > > > > > > > > > > beginTxn
> > > > > > > > > > > > > call
> > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > >> >> > throw,
> > > > > > > > > > > > > > > >> >> > > it
> > > > > > > > > > > > > > > >> >> > > > may violate some callers with a
> > pattern
> > > > > that
> > > > > > > > > relying
> > > > > > > > > > on
> > > > > > > > > > > > > > > >> commitTxn to
> > > > > > > > > > > > > > > >> >> > > > succeed to make some non-rollback
> > > > > operations.
> > > > > > > For
> > > > > > > > > > > > example:
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > > > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > > > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > > > > > > > >> >> > > > // if commitTxn succeeds, then commit
> > > the
> > > > > DB
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > -------------
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > The issue is that, committing DB is a
> > > > > > > > non-rollback
> > > > > > > > > > > > > operation,
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > >> >> users
> > > > > > > > > > > > > > > >> >> > > may
> > > > > > > > > > > > > > > >> >> > > > just rely on commitTxn to return
> > > without
> > > > > > error
> > > > > > > to
> > > > > > > > > > make
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > >> >> > non-rollback
> > > > > > > > > > > > > > > >> >> > > > call. Of course we can just claim
> > this
> > > > > > pattern
> > > > > > > is
> > > > > > > > > not
> > > > > > > > > > > > > > > legitimate
> > > > > > > > > > > > > > > >> >> and is
> > > > > > > > > > > > > > > >> >> > > not
> > > > > > > > > > > > > > > >> >> > > > the right way of doing things, but
> > many
> > > > > users
> > > > > > > may
> > > > > > > > > > > > naturally
> > > > > > > > > > > > > > > adopt
> > > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > > >> >> > > > pattern.
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > So maybe we should still let
> > commitTxn
> > > > also
> > > > > > > throw
> > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > >> >> exceptions,
> > > > > > > > > > > > > > > >> >> > > in
> > > > > > > > > > > > > > > >> >> > > > which case we would then call
> > abortTxn
> > > > > again.
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > But if we do this, the pattern
> > becomes:
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > > > > > > > >> >> > > >     try {
> > > > > > > > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > > > > > > > >> >> > > >     } catch (...) {        //
> > enumerate
> > > > all
> > > > > > > fatal
> > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > Thank makes me think, the alternative
> > > we
> > > > > have
> > > > > > > > > > discussed
> > > > > > > > > > > > > offline
> > > > > > > > > > > > > > > >> may
> > > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > > >> >> > > > better: let commitTxn() to return a
> > > > boolean
> > > > > > > flag.
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > * If it returns true, it means the
> > > commit
> > > > > > > > > succeeded.
> > > > > > > > > > > > Users
> > > > > > > > > > > > > can
> > > > > > > > > > > > > > > >> >> > > comfortably
> > > > > > > > > > > > > > > >> >> > > > continue and do any external
> > > non-rollback
> > > > > > > > > operations
> > > > > > > > > > if
> > > > > > > > > > > > > they
> > > > > > > > > > > > > > > >> like.
> > > > > > > > > > > > > > > >> >> > > > * If it returns false, it means the
> > > > commit
> > > > > > > failed
> > > > > > > > > > with
> > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > >> >> error
> > > > > > > > > > > > > > > >> >> > > *AND
> > > > > > > > > > > > > > > >> >> > > > the txn has been aborted*. Users do
> > not
> > > > > need
> > > > > > to
> > > > > > > > > call
> > > > > > > > > > > > > abortTxn
> > > > > > > > > > > > > > > >> again.
> > > > > > > > > > > > > > > >> >> > > > * If it throws, then it means fatal
> > > > errors.
> > > > > > > Users
> > > > > > > > > > > should
> > > > > > > > > > > > > shut
> > > > > > > > > > > > > > > >> down
> > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > >> >> > > > client.
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > > > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > > > > > > > > > > >> >> > > >     } else {
> > > > > > > > > > > > > > > >> >> > > >         // reset offsets, rollback
> > > > > > operations,
> > > > > > > > etc
> > > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > > > > > > > >> >> > > >    // do other non-rollbackable
> > things
> > > > > > > > > > > > > > > >> >> > > > else
> > > > > > > > > > > > > > > >> >> > > >    // reset offsets, rollback
> > > operations,
> > > > > etc
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > Of course, if we want to have better
> > > > > > visibility
> > > > > > > > > into
> > > > > > > > > > > what
> > > > > > > > > > > > > > > caused
> > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > >> >> > > commit
> > > > > > > > > > > > > > > >> >> > > > to fail and txn to abort. We can let
> > > the
> > > > > > return
> > > > > > > > > type
> > > > > > > > > > be
> > > > > > > > > > > > an
> > > > > > > > > > > > > enum
> > > > > > > > > > > > > > > >> >> instead
> > > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > > >> >> > > > a flag. But my main idea is to still
> > > let
> > > > > the
> > > > > > > > > > commitTxn
> > > > > > > > > > > be
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > >> final
> > > > > > > > > > > > > > > >> >> > point
> > > > > > > > > > > > > > > >> >> > > > users can tell whether this txn
> > > succeeded
> > > > > or
> > > > > > > not,
> > > > > > > > > > > instead
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > >> >> relying on
> > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > I agree that adding a boolean flag is
> > > > > indeed
> > > > > > > > useful
> > > > > > > > > > in
> > > > > > > > > > > > this
> > > > > > > > > > > > > > > case.
> > > > > > > > > > > > > > > >> >> Will
> > > > > > > > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> > > 1. Re: "while maintaining the behavior
> > to
> > > > > throw
> > > > > > > > fatal
> > > > > > > > > > > > > exception
> > > > > > > > > > > > > > > in
> > > > > > > > > > > > > > > >> >> raw"
> > > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > > >> >> > > > sure what do you mean by "throw"
> > here.
> > > > Are
> > > > > > you
> > > > > > > > > > > proposing
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > >> >> callback
> > > > > > > > > > > > > > > >> >> > > would
> > > > > > > > > > > > > > > >> >> > > > *pass* (not throw) in any fatal
> > > > exceptions
> > > > > > when
> > > > > > > > > > > triggered
> > > > > > > > > > > > > > > without
> > > > > > > > > > > > > > > >> >> > > wrapping?
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > Yes, I want to say *pass*, the
> > benefit
> > > is
> > > > > to
> > > > > > > make
> > > > > > > > > the
> > > > > > > > > > > end
> > > > > > > > > > > > > > > user's
> > > > > > > > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> > > > 2. I'm not sure I understand the
> > claim
> > > > > > > regarding
> > > > > > > > > the
> > > > > > > > > > > > > callback
> > > > > > > > > > > > > > > >> that
> > > > > > > > > > > > > > > >> >> "In
> > > > > > > > > > > > > > > >> >> > > EOS
> > > > > > > > > > > > > > > >> >> > > > setup, it is not required to handle
> > the
> > > > > > > > exception".
> > > > > > > > > > Are
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > >> >> proposing
> > > > > > > > > > > > > > > >> >> > > that,
> > > > > > > > > > > > > > > >> >> > > > e.g. in Streams, we do not try to
> > > handle
> > > > > any
> > > > > > > > > > exceptions
> > > > > > > > > > > > if
> > > > > > > > > > > > > EOS
> > > > > > > > > > > > > > > is
> > > > > > > > > > > > > > > >> >> > enabled
> > > > > > > > > > > > > > > >> >> > > > in the callback anymore, but just let
> > > > > > > commitTxn()
> > > > > > > > > > > itself
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > > >> >> > > > notified about the problem?
> > Personally
> > > I
> > > > > > think
> > > > > > > in
> > > > > > > > > > > Streams
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > >> should
> > > > > > > > > > > > > > > >> >> > just
> > > > > > > > > > > > > > > >> >> > > > make the handling logic of the
> > callback
> > > > to
> > > > > be
> > > > > > > > > > > consistent
> > > > > > > > > > > > > > > >> regardless
> > > > > > > > > > > > > > > >> >> of
> > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > >> >> > > > EOS settings (today we have different
> > > > logic
> > > > > > > > > depending
> > > > > > > > > > > on
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > >> logic,
> > > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > > >> >> > > > I think could be unified as well).
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > My idea originates from the claim on
> > > send
> > > > > > API:
> > > > > > > > > > > > > > > >> >> > > "When used as part of a transaction, it
> > > is
> > > > > not
> > > > > > > > > > necessary
> > > > > > > > > > > to
> > > > > > > > > > > > > > > define
> > > > > > > > > > > > > > > >> a
> > > > > > > > > > > > > > > >> >> > > callback or check the result of the
> > > future
> > > > > in
> > > > > > > > order
> > > > > > > > > to
> > > > > > > > > > > > > detect
> > > > > > > > > > > > > > > >> errors
> > > > > > > > > > > > > > > >> >> > from
> > > > > > > > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > > > > > > > >> >> > > My understanding is that for EOS, the
> > > > > exception
> > > > > > > > will
> > > > > > > > > be
> > > > > > > > > > > > > detected
> > > > > > > > > > > > > > > by
> > > > > > > > > > > > > > > >> >> > calling
> > > > > > > > > > > > > > > >> >> > > transactional APIs either way, so it
> > is a
> > > > > > > duplicate
> > > > > > > > > > > > handling
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > >> track
> > > > > > > > > > > > > > > >> >> > > all the sendExceptions in
> > > RecordCollector.
> > > > > > > > However, I
> > > > > > > > > > > > looked
> > > > > > > > > > > > > up
> > > > > > > > > > > > > > > >> >> > > sendException is being used today as
> > > > follow:
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler"
> > > for
> > > > > any
> > > > > > > > > default
> > > > > > > > > > > or
> > > > > > > > > > > > > > > >> customized
> > > > > > > > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > > > > > > > >> >> > > 2. Stop collecting offset info or new
> > > > > > exceptions
> > > > > > > > > > > > > > > >> >> > > 3. Check and rethrow exceptions in
> > > close(),
> > > > > > > flush()
> > > > > > > > > or
> > > > > > > > > > > new
> > > > > > > > > > > > > send()
> > > > > > > > > > > > > > > >> >> calls
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> > > To my understanding, we should still
> > > honor
> > > > > the
> > > > > > > > > > commitment
> > > > > > > > > > > > to
> > > > > > > > > > > > > #1
> > > > > > > > > > > > > > > for
> > > > > > > > > > > > > > > >> >> any
> > > > > > > > > > > > > > > >> >> > > user customized implementation. The #2
> > > does
> > > > > not
> > > > > > > > > really
> > > > > > > > > > > > > affect our
> > > > > > > > > > > > > > > >> >> > decision
> > > > > > > > > > > > > > > >> >> > > upon EOS. The #3 here is still valuable
> > > as
> > > > it
> > > > > > > could
> > > > > > > > > > help
> > > > > > > > > > > us
> > > > > > > > > > > > > fail
> > > > > > > > > > > > > > > >> fast
> > > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > > >> >> > > new send() instead of waiting to later
> > > > stage
> > > > > of
> > > > > > > > > > > processing.
> > > > > > > > > > > > > In
> > > > > > > > > > > > > > > that
> > > > > > > > > > > > > > > >> >> > sense,
> > > > > > > > > > > > > > > >> >> > > I agree we should continue to record
> > send
> > > > > > > > exceptions
> > > > > > > > > > even
> > > > > > > > > > > > > under
> > > > > > > > > > > > > > > EOS
> > > > > > > > > > > > > > > >> >> case
> > > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > > >> >> > > ensure the strength of stream side
> > > Producer
> > > > > > > logic.
> > > > > > > > On
> > > > > > > > > > the
> > > > > > > > > > > > > safer
> > > > > > > > > > > > > > > >> side,
> > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > >> >> > no
> > > > > > > > > > > > > > > >> >> > > longer need to wrap certain fatal
> > > > exceptions
> > > > > > like
> > > > > > > > > > > > > ProducerFenced
> > > > > > > > > > > > > > > as
> > > > > > > > > > > > > > > >> >> > > TaskMigrated, since they should not
> > crash
> > > > the
> > > > > > > > stream
> > > > > > > > > > > thread
> > > > > > > > > > > > > if
> > > > > > > > > > > > > > > >> thrown
> > > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > > >> >> > > raw format, once we adopt the new
> > > > processing
> > > > > > > model
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > > send
> > > > > > > > > > > > > > > >> phase.
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM
> > Boyang
> > > > > Chen <
> > > > > > > > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > > Thanks for everyone's feedback so
> > > far.
> > > > I
> > > > > > have
> > > > > > > > > > > polished
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > > >> >> after
> > > > > > > > > > > > > > > >> >> > > > > offline discussion with some folks
> > > > > working
> > > > > > on
> > > > > > > > EOS
> > > > > > > > > > to
> > > > > > > > > > > > > make the
> > > > > > > > > > > > > > > >> >> > exception
> > > > > > > > > > > > > > > >> >> > > > > handling more lightweight. The
> > > > essential
> > > > > > > change
> > > > > > > > > is
> > > > > > > > > > > that
> > > > > > > > > > > > > we
> > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > >> not
> > > > > > > > > > > > > > > >> >> > > > > inventing a new intermediate
> > > exception
> > > > > > type,
> > > > > > > > but
> > > > > > > > > > > > instead
> > > > > > > > > > > > > > > >> >> separating a
> > > > > > > > > > > > > > > >> >> > > > full
> > > > > > > > > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > >> >> > > > > This way for any exception thrown
> > > from
> > > > > > phase
> > > > > > > 1,
> > > > > > > > > > will
> > > > > > > > > > > be
> > > > > > > > > > > > > an
> > > > > > > > > > > > > > > >> >> indicator
> > > > > > > > > > > > > > > >> >> > in
> > > > > > > > > > > > > > > >> >> > > > > phase 2 whether we should do commit
> > > or
> > > > > > abort,
> > > > > > > > and
> > > > > > > > > > > from
> > > > > > > > > > > > > now on
> > > > > > > > > > > > > > > >> >> > > > > `commitTransaction` should only
> > throw
> > > > > fatal
> > > > > > > > > > > exceptions,
> > > > > > > > > > > > > > > >> similar to
> > > > > > > > > > > > > > > >> >> > > > > `abortTransaction`, so that any
> > > > > > > KafkaException
> > > > > > > > > > caught
> > > > > > > > > > > > in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > >> >> commit
> > > > > > > > > > > > > > > >> >> > > phase
> > > > > > > > > > > > > > > >> >> > > > > will be definitely fatal to crash
> > the
> > > > > app.
> > > > > > > For
> > > > > > > > > more
> > > > > > > > > > > > > advanced
> > > > > > > > > > > > > > > >> users
> > > > > > > > > > > > > > > >> >> > such
> > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > >> >> > > > > Streams, we have the ability to
> > > further
> > > > > > wrap
> > > > > > > > > > selected
> > > > > > > > > > > > > types
> > > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > > >> >> > > > > exceptions to trigger task
> > migration
> > > if
> > > > > > > > > necessary.
> > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > >> >> > > > > More details in the KIP, feel free
> > to
> > > > > take
> > > > > > > > > another
> > > > > > > > > > > > look,
> > > > > > > > > > > > > > > >> thanks!
> > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM
> > > Boyang
> > > > > > Chen <
> > > > > > > > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM
> > > Bruno
> > > > > > > Cadonna
> > > > > > > > <
> > > > > > > > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >> Like Matthias, I do also not
> > know
> > > > the
> > > > > > > > producer
> > > > > > > > > > > > > internal
> > > > > > > > > > > > > > > well
> > > > > > > > > > > > > > > >> >> > enough
> > > > > > > > > > > > > > > >> >> > > to
> > > > > > > > > > > > > > > >> >> > > > > >> comment on the categorization.
> > > > > However,
> > > > > > I
> > > > > > > > > think
> > > > > > > > > > > > > having a
> > > > > > > > > > > > > > > >> super
> > > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > > >> >> > > > > >> (e.g. RetriableException) that
> > > > > encodes
> > > > > > if
> > > > > > > > an
> > > > > > > > > > > > > exception is
> > > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > > >> >> > > > not
> > > > > > > > > > > > > > > >> >> > > > > >> is cleaner, better
> > understandable
> > > > and
> > > > > > less
> > > > > > > > > > > > > error-prone,
> > > > > > > > > > > > > > > >> because
> > > > > > > > > > > > > > > >> >> > > > ideally
> > > > > > > > > > > > > > > >> >> > > > > >> when you add a new non-fatal
> > > > exception
> > > > > > in
> > > > > > > > > future
> > > > > > > > > > > you
> > > > > > > > > > > > > just
> > > > > > > > > > > > > > > >> need
> > > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > > >> >> > > > > >> about letting it inherit from
> > the
> > > > > super
> > > > > > > > > > exception
> > > > > > > > > > > > and
> > > > > > > > > > > > > all
> > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > >> >> rest
> > > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > > >> >> > > > > >> the code will just behave
> > > correctly
> > > > > > > without
> > > > > > > > > the
> > > > > > > > > > > need
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > wrap
> > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > >> >> > new
> > > > > > > > > > > > > > > >> >> > > > > >> exception into another exception
> > > > each
> > > > > > time
> > > > > > > > it
> > > > > > > > > is
> > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > >> (maybe
> > > > > > > > > > > > > > > >> >> it
> > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > >> >> > > > > >> thrown at different location in
> > > the
> > > > > > code).
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >> As far as I understand the
> > > following
> > > > > > > > statement
> > > > > > > > > > > from
> > > > > > > > > > > > > your
> > > > > > > > > > > > > > > >> >> previous
> > > > > > > > > > > > > > > >> >> > > > e-mail
> > > > > > > > > > > > > > > >> >> > > > > >> is the reason that currently
> > such
> > > a
> > > > > > super
> > > > > > > > > > > exception
> > > > > > > > > > > > > is not
> > > > > > > > > > > > > > > >> >> > possible:
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >> "Right now we have
> > > > RetriableException
> > > > > > > type,
> > > > > > > > if
> > > > > > > > > > we
> > > > > > > > > > > > are
> > > > > > > > > > > > > > > going
> > > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > > >> >> > add a
> > > > > > > > > > > > > > > >> >> > > > > >> `ProducerRetriableException`
> > type,
> > > > we
> > > > > > have
> > > > > > > > to
> > > > > > > > > > put
> > > > > > > > > > > > > this new
> > > > > > > > > > > > > > > >> >> > interface
> > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > >> >> > > > > >> the parent of the
> > > > RetriableException,
> > > > > > > > because
> > > > > > > > > > not
> > > > > > > > > > > > all
> > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > > >> >> > > > > >> exceptions are `retriable` in
> > > > general
> > > > > > for
> > > > > > > > > > > producer"
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >> In the list of exceptions in
> > your
> > > > > KIP, I
> > > > > > > > found
> > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > >> >> > exceptions
> > > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > > >> >> > > > > >> do not inherit from
> > > > > RetriableException.
> > > > > > I
> > > > > > > > > guess
> > > > > > > > > > > > those
> > > > > > > > > > > > > are
> > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > >> >> ones
> > > > > > > > > > > > > > > >> >> > > you
> > > > > > > > > > > > > > > >> >> > > > > >> are referring to in your
> > > statement:
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > > > > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > > > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >> All of those exceptions are
> > > > non-fatal
> > > > > > and
> > > > > > > do
> > > > > > > > > not
> > > > > > > > > > > > > inherit
> > > > > > > > > > > > > > > >> from
> > > > > > > > > > > > > > > >> >> > > > > >> RetriableException. Is there a
> > > > reason
> > > > > > for
> > > > > > > > > that?
> > > > > > > > > > If
> > > > > > > > > > > > > they
> > > > > > > > > > > > > > > >> >> depended
> > > > > > > > > > > > > > > >> >> > > from
> > > > > > > > > > > > > > > >> >> > > > > >> RetriableException we would be a
> > > bit
> > > > > > > closer
> > > > > > > > > to a
> > > > > > > > > > > > super
> > > > > > > > > > > > > > > >> >> exception I
> > > > > > > > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >> The reason is that sender may
> > > catch
> > > > > > those
> > > > > > > > > > > exceptions
> > > > > > > > > > > > > in
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > >> >> > > > > > ProduceResponse, and it currently
> > > > does
> > > > > > > > infinite
> > > > > > > > > > > > > > > >> >> > > > > > retries on RetriableException. To
> > > > make
> > > > > > sure
> > > > > > > > we
> > > > > > > > > > > could
> > > > > > > > > > > > > > > trigger
> > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > >> >> > > > > > abortTransaction() by catching
> > > > > non-fatal
> > > > > > > > thrown
> > > > > > > > > > > > > > > >> >> > > > > > exceptions and reinitialize the
> > txn
> > > > > > state,
> > > > > > > we
> > > > > > > > > > chose
> > > > > > > > > > > > > not to
> > > > > > > > > > > > > > > >> let
> > > > > > > > > > > > > > > >> >> > those
> > > > > > > > > > > > > > > >> >> > > > > > exceptions inherit
> > > RetriableException
> > > > > so
> > > > > > > that
> > > > > > > > > > > > > > > >> >> > > > > > they won't cause infinite retry
> > on
> > > > > > sender.
> > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > >> >> > > > > >> With OutOfOrderSequenceException
> > > and
> > > > > > > > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > > >> >> > > > > >> to understand that their
> > fatality
> > > > > > depends
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > > type
> > > > > > > > > > > > > > > (i.e.
> > > > > > > > > > > > > > > >> >> > > > > >> configuration) of the producer.
> > > That
> > > > > > makes
> > > > > > > > it
> > > > > > > > > > > > > difficult to
> > > > > > > > > > > > > > > >> >> have a
> > > > > > > > > > > > > > > >> >> > > > super
> > > > > > > > > > > > > > > >> >> > > > > >> exception that encodes the
> > > > > retriability
> > > > > > as
> > > > > > > > > > > mentioned
> > > > > > > > > > > > > > > above.
> > > > > > > > > > > > > > > >> >> Would
> > > > > > > > > > > > > > > >> >> > it
> > > > > > > > > > > > > > > >> >> > > > be
> > > > > > > > > > > > > > > >> >> > > > > >> possible to introduce exceptions
> > > > that
> > > > > > > > inherit
> > > > > > > > > > from
> > > > > > > > > > > > > > > >> >> > > RetriableException
> > > > > > > > > > > > > > > >> >> > > > > >> and that are thrown when those
> > > > > > exceptions
> > > > > > > > are
> > > > > > > > > > > caught
> > > > > > > > > > > > > from
> > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > >> >> > > brokers
> > > > > > > > > > > > > > > >> >> > > > > >> and the type of the producer is
> > > such
> > > > > > that
> > > > > > > > the
> > > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > > are
> > > > > > > > > > > > > > > >> >> > > > retriable?
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >> Yea, I think in general the
> > > > exception
> > > > > > type
> > > > > > > > > > mixing
> > > > > > > > > > > is
> > > > > > > > > > > > > > > >> difficult
> > > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > > >> >> > > get
> > > > > > > > > > > > > > > >> >> > > > > > them all right. I have already
> > > > proposed
> > > > > > > > another
> > > > > > > > > > > > > solution
> > > > > > > > > > > > > > > >> based
> > > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > > >> >> > my
> > > > > > > > > > > > > > > >> >> > > > > > offline discussion with some
> > folks
> > > > > > working
> > > > > > > on
> > > > > > > > > EOS
> > > > > > > > > > > > > > > >> >> > > > > > to make the handling more
> > > > > straightforward
> > > > > > > for
> > > > > > > > > end
> > > > > > > > > > > > users
> > > > > > > > > > > > > > > >> without
> > > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > > >> >> > > > need
> > > > > > > > > > > > > > > >> >> > > > > > to distinguish exception
> > fatality.
> > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang
> > > > > wrote:
> > > > > > > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the
> > proposal!
> > > I
> > > > > > made a
> > > > > > > > > pass
> > > > > > > > > > > over
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > >> list
> > > > > > > > > > > > > > > >> >> and
> > > > > > > > > > > > > > > >> >> > > here
> > > > > > > > > > > > > > > >> >> > > > > are
> > > > > > > > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> > 0) Although this is not part
> > of
> > > > the
> > > > > > > public
> > > > > > > > > > API,
> > > > > > > > > > > I
> > > > > > > > > > > > > think
> > > > > > > > > > > > > > > we
> > > > > > > > > > > > > > > >> >> > should
> > > > > > > > > > > > > > > >> >> > > > make
> > > > > > > > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > > > > > > > >> >> > > > > >> > that the suggested pattern
> > (i.e.
> > > > > user
> > > > > > > can
> > > > > > > > > > always
> > > > > > > > > > > > > call
> > > > > > > > > > > > > > > >> >> abortTxn()
> > > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > > >> >> > > > > >> > handling non-fatal errors) are
> > > > > indeed
> > > > > > > > > > supported.
> > > > > > > > > > > > > E.g. if
> > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > >> >> txn
> > > > > > > > > > > > > > > >> >> > > is
> > > > > > > > > > > > > > > >> >> > > > > >> already
> > > > > > > > > > > > > > > >> >> > > > > >> > aborted by the broker side,
> > then
> > > > > users
> > > > > > > can
> > > > > > > > > > still
> > > > > > > > > > > > > call
> > > > > > > > > > > > > > > >> >> abortTxn
> > > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > > >> >> > > > > >> would
> > > > > > > > > > > > > > > >> >> > > > > >> > not throw another exception
> > but
> > > > just
> > > > > > be
> > > > > > > > > > treated
> > > > > > > > > > > > as a
> > > > > > > > > > > > > > > >> no-op.
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> > 1)
> > > > > *ConcurrentTransactionsException*:
> > > > > > I
> > > > > > > > > think
> > > > > > > > > > > this
> > > > > > > > > > > > > error
> > > > > > > > > > > > > > > >> can
> > > > > > > > > > > > > > > >> >> > also
> > > > > > > > > > > > > > > >> >> > > be
> > > > > > > > > > > > > > > >> >> > > > > >> > returned but not documented
> > yet.
> > > > > This
> > > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > > a
> > > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > > >> >> > error.
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*:
> > > > this
> > > > > > > error
> > > > > > > > is
> > > > > > > > > > > > > returned
> > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > >> >> > broker
> > > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > > > > > > > >> >> > > > > >> > state transition failed (e.g.
> > it
> > > > is
> > > > > > > trying
> > > > > > > > > to
> > > > > > > > > > > > > transit to
> > > > > > > > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > > > > > > > >> >> > > > > >> > while the current state is not
> > > > > > > > > > prepare-commit).
> > > > > > > > > > > > This
> > > > > > > > > > > > > > > error
> > > > > > > > > > > > > > > >> >> could
> > > > > > > > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > > > > > > > >> >> > > > > >> > a bug on the client internal
> > > code
> > > > or
> > > > > > the
> > > > > > > > > > broker
> > > > > > > > > > > > > code,
> > > > > > > > > > > > > > > OR a
> > > > > > > > > > > > > > > >> >> user
> > > > > > > > > > > > > > > >> >> > > > error
> > > > > > > > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > > > > > > > >> >> > > > > >> > similar error is
> > > > > > > > > > > ConcurrentTransactionsException,
> > > > > > > > > > > > > i.e.
> > > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > >> >> Kafka
> > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > > > > > > > >> >> > > > > >> > these exceptions would only be
> > > > > > returned
> > > > > > > if
> > > > > > > > > > users
> > > > > > > > > > > > > try to
> > > > > > > > > > > > > > > do
> > > > > > > > > > > > > > > >> >> > > something
> > > > > > > > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn right
> > > after
> > > > a
> > > > > > > > > commitTxn,
> > > > > > > > > > > > etc.
> > > > > > > > > > > > > So
> > > > > > > > > > > > > > > I'm
> > > > > > > > > > > > > > > >> >> > > thinking
> > > > > > > > > > > > > > > >> >> > > > it
> > > > > > > > > > > > > > > >> >> > > > > >> > should be a non-fatal error
> > > > instead
> > > > > > of a
> > > > > > > > > fatal
> > > > > > > > > > > > > error,
> > > > > > > > > > > > > > > >> wdyt?
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i
> > > > > "indicates
> > > > > > > > fatal
> > > > > > > > > > > > > > > transactional
> > > > > > > > > > > > > > > >> >> > > sequence
> > > > > > > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit
> > > > conflicting
> > > > > > with
> > > > > > > > the
> > > > > > > > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > > > > > > > >> >> > > > > >> that
> > > > > > > > > > > > > > > >> >> > > > > >> > is treated as non-fatal. I
> > guess
> > > > > your
> > > > > > > > > proposal
> > > > > > > > > > > is
> > > > > > > > > > > > > that
> > > > > > > > > > > > > > > >> >> > > > > >> > OutOfOrderSequenceException
> > > would
> > > > be
> > > > > > > > treated
> > > > > > > > > > > > either
> > > > > > > > > > > > > as
> > > > > > > > > > > > > > > >> fatal
> > > > > > > > > > > > > > > >> >> > with
> > > > > > > > > > > > > > > >> >> > > > > >> > transactional producer, or
> > > > non-fatal
> > > > > > > with
> > > > > > > > > > > > idempotent
> > > > > > > > > > > > > > > >> >> producer,
> > > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > > >> >> > > > > >> > right? If the producer is only
> > > > > > > configured
> > > > > > > > > with
> > > > > > > > > > > > > > > idempotency
> > > > > > > > > > > > > > > >> >> but
> > > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > > > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > > > > > > > >> >> > > for
> > > > > > > > > > > > > > > >> >> > > > > >> > non-fatal errors would be
> > > > confusing
> > > > > > > since
> > > > > > > > > > users
> > > > > > > > > > > > are
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > >> using
> > > > > > > > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > > > > > > > >> >> > > > > >> > at all.. So I suggest we
> > always
> > > > > throw
> > > > > > > > > > > > > > > >> OutOfSequenceException
> > > > > > > > > > > > > > > >> >> > as-is
> > > > > > > > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > > > > > > > >> >> > > > > >> > not wrapped) no matter how the
> > > > > > producer
> > > > > > > is
> > > > > > > > > > > > > configured,
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > >> >> let
> > > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > > > > > > > >> >> > > > > >> > decide how to handle it based
> > on
> > > > > > whether
> > > > > > > > it
> > > > > > > > > is
> > > > > > > > > > > > only
> > > > > > > > > > > > > > > >> >> idempotent
> > > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs,
> > the
> > > > > > > `send()`
> > > > > > > > > > > > callback /
> > > > > > > > > > > > > > > >> future
> > > > > > > > > > > > > > > >> >> can
> > > > > > > > > > > > > > > >> >> > > also
> > > > > > > > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > > > > > > > >> >> > > > > >> > txn-related exceptions, I
> > think
> > > > this
> > > > > > KIP
> > > > > > > > > > should
> > > > > > > > > > > > also
> > > > > > > > > > > > > > > cover
> > > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > > >> >> > > API
> > > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> > 5) This is related to 1/2)
> > > above:
> > > > > > > > sometimes
> > > > > > > > > > > those
> > > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > > >> >> > > > > like
> > > > > > > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or
> > InvalidTxnState
> > > > are
> > > > > > not
> > > > > > > > due
> > > > > > > > > > to
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > state
> > > > > > > > > > > > > > > >> >> being
> > > > > > > > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > > > > > > > >> >> > > > > >> > at the broker side, but maybe
> > > > users
> > > > > > are
> > > > > > > > > doing
> > > > > > > > > > > > > something
> > > > > > > > > > > > > > > >> >> wrong.
> > > > > > > > > > > > > > > >> >> > So
> > > > > > > > > > > > > > > >> >> > > > I'm
> > > > > > > > > > > > > > > >> >> > > > > >> > wondering if we should further
> > > > > > > distinguish
> > > > > > > > > > those
> > > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > > >> >> > > > > >> between
> > > > > > > > > > > > > > > >> >> > > > > >> > a) those that are caused by
> > > Kafka
> > > > > > > itself,
> > > > > > > > > > e.g. a
> > > > > > > > > > > > > broker
> > > > > > > > > > > > > > > >> timed
> > > > > > > > > > > > > > > >> >> > out
> > > > > > > > > > > > > > > >> >> > > > and
> > > > > > > > > > > > > > > >> >> > > > > >> > aborted a txn and later an
> > > endTxn
> > > > > > > request
> > > > > > > > is
> > > > > > > > > > > > > received,
> > > > > > > > > > > > > > > >> and b)
> > > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > > >> >> > > > > user's
> > > > > > > > > > > > > > > >> >> > > > > >> > API call pattern is incorrect,
> > > > > causing
> > > > > > > the
> > > > > > > > > > > request
> > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > >> >> > rejected
> > > > > > > > > > > > > > > >> >> > > > with
> > > > > > > > > > > > > > > >> >> > > > > >> an
> > > > > > > > > > > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > > > > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > > > > > > > > >> >> > > > feels
> > > > > > > > > > > > > > > >> >> > > > > >> to
> > > > > > > > > > > > > > > >> >> > > > > >> > me more like for case a), but
> > > not
> > > > > case
> > > > > > > b).
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM
> > > > > Boyang
> > > > > > > > Chen <
> > > > > > > > > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your
> > > > > > proposal
> > > > > > > > > makes
> > > > > > > > > > > > sense
> > > > > > > > > > > > > as
> > > > > > > > > > > > > > > >> well,
> > > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > >> >> > > > > pro
> > > > > > > > > > > > > > > >> >> > > > > >> side
> > > > > > > > > > > > > > > >> >> > > > > >> >> we could have a universally
> > > > agreed
> > > > > > > > > exception
> > > > > > > > > > > type
> > > > > > > > > > > > > to be
> > > > > > > > > > > > > > > >> >> caught
> > > > > > > > > > > > > > > >> >> > by
> > > > > > > > > > > > > > > >> >> > > > the
> > > > > > > > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > > > > > > > >> >> > > > > >> >> without having an extra layer
> > > on
> > > > > top
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > > actual
> > > > > > > > > > > > > > > >> >> exceptions.
> > > > > > > > > > > > > > > >> >> > I
> > > > > > > > > > > > > > > >> >> > > > > could
> > > > > > > > > > > > > > > >> >> > > > > >> see
> > > > > > > > > > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > >> >> > > > > >> >> 1. The exception hierarchy
> > will
> > > > be
> > > > > > more
> > > > > > > > > > > complex.
> > > > > > > > > > > > > Right
> > > > > > > > > > > > > > > >> now
> > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > >> >> > > have
> > > > > > > > > > > > > > > >> >> > > > > >> >> RetriableException type, if
> > we
> > > > are
> > > > > > > going
> > > > > > > > to
> > > > > > > > > > > add a
> > > > > > > > > > > > > > > >> >> > > > > >> >> `ProducerRetriableException`
> > > > type,
> > > > > we
> > > > > > > > have
> > > > > > > > > to
> > > > > > > > > > > put
> > > > > > > > > > > > > this
> > > > > > > > > > > > > > > >> new
> > > > > > > > > > > > > > > >> >> > > > interface
> > > > > > > > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > > > > > > > >> >> > > > > >> >> parent of the
> > > RetriableException,
> > > > > > > because
> > > > > > > > > not
> > > > > > > > > > > all
> > > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > > >> >> > > > > >> >> exceptions are `retriable` in
> > > > > general
> > > > > > > for
> > > > > > > > > > > > > producer, for
> > > > > > > > > > > > > > > >> >> example
> > > > > > > > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > > > > > > > >> >> > > > > >> >> We could have an empty
> > > interface
> > > > > > > > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > > >> >> > > > let
> > > > > > > > > > > > > > > >> >> > > > > >> all
> > > > > > > > > > > > > > > >> >> > > > > >> >> the thrown exceptions
> > implement
> > > > for
> > > > > > > sure,
> > > > > > > > > but
> > > > > > > > > > > > it's
> > > > > > > > > > > > > a
> > > > > > > > > > > > > > > bit
> > > > > > > > > > > > > > > >> >> > > unorthodox
> > > > > > > > > > > > > > > >> >> > > > > in
> > > > > > > > > > > > > > > >> >> > > > > >> the
> > > > > > > > > > > > > > > >> >> > > > > >> >> way we deal with exceptions
> > in
> > > > > > general.
> > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > >> >> > > > > >> >> 2. There are cases where we
> > > > throw a
> > > > > > > > > > > > KafkaException
> > > > > > > > > > > > > > > >> wrapping
> > > > > > > > > > > > > > > >> >> > > another
> > > > > > > > > > > > > > > >> >> > > > > >> >> KafkaException as either
> > fatal
> > > or
> > > > > > > > > non-fatal.
> > > > > > > > > > If
> > > > > > > > > > > > we
> > > > > > > > > > > > > use
> > > > > > > > > > > > > > > an
> > > > > > > > > > > > > > > >> >> > > interface
> > > > > > > > > > > > > > > >> >> > > > > to
> > > > > > > > > > > > > > > >> >> > > > > >> >> solve #1, it is also required
> > > to
> > > > > > > > implement
> > > > > > > > > > > > another
> > > > > > > > > > > > > > > >> bloated
> > > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > > >> >> > > > > >> >> which could replace
> > > > KafkaException
> > > > > > > type,
> > > > > > > > as
> > > > > > > > > > we
> > > > > > > > > > > > > couldn't
> > > > > > > > > > > > > > > >> mark
> > > > > > > > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > >> >> > > > > >> >> 3. In terms of the
> > > encapsulation,
> > > > > > > > wrapping
> > > > > > > > > > > means
> > > > > > > > > > > > we
> > > > > > > > > > > > > > > could
> > > > > > > > > > > > > > > >> >> limit
> > > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > > > > > > > >> >> > > > > >> >> of affection to the producer
> > > > only,
> > > > > > > which
> > > > > > > > is
> > > > > > > > > > > > > important
> > > > > > > > > > > > > > > >> since
> > > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > > >> >> > > > don't
> > > > > > > > > > > > > > > >> >> > > > > >> want
> > > > > > > > > > > > > > > >> >> > > > > >> >> shared exception types to
> > > > > implement a
> > > > > > > > > > > > > producer-related
> > > > > > > > > > > > > > > >> >> > interface,
> > > > > > > > > > > > > > > >> >> > > > > such
> > > > > > > > > > > > > > > >> >> > > > > >> >> as
> > > > > UnknownTopicOrPartitionException.
> > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38
> > PM
> > > > > > Matthias
> > > > > > > > J.
> > > > > > > > > > Sax
> > > > > > > > > > > <
> > > > > > > > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>> Overall, categorizing
> > > exceptions
> > > > > > > makes a
> > > > > > > > > lot
> > > > > > > > > > > of
> > > > > > > > > > > > > sense.
> > > > > > > > > > > > > > > >> As I
> > > > > > > > > > > > > > > >> >> > > don't
> > > > > > > > > > > > > > > >> >> > > > > know
> > > > > > > > > > > > > > > >> >> > > > > >> >>> the producer internals well
> > > > > enough,
> > > > > > I
> > > > > > > > > cannot
> > > > > > > > > > > > > comment
> > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > > >> >> > > > > >> >>> categorization in detail
> > > though.
> > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>> What I am wondering is, if
> > we
> > > > > should
> > > > > > > > > > introduce
> > > > > > > > > > > > an
> > > > > > > > > > > > > > > >> exception
> > > > > > > > > > > > > > > >> >> > > > > interface
> > > > > > > > > > > > > > > >> >> > > > > >> >>> that non-fatal exception
> > would
> > > > > > > implement
> > > > > > > > > > > instead
> > > > > > > > > > > > > of
> > > > > > > > > > > > > > > >> >> creating a
> > > > > > > > > > > > > > > >> >> > > new
> > > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal
> > > > > exceptions?
> > > > > > > > What
> > > > > > > > > > > would
> > > > > > > > > > > > > be the
> > > > > > > > > > > > > > > >> >> > pros/cons
> > > > > > > > > > > > > > > >> >> > > > for
> > > > > > > > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang
> > > Chen
> > > > > > > wrote:
> > > > > > > > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>>> I would like to start a
> > > > > discussion
> > > > > > > > thread
> > > > > > > > > > for
> > > > > > > > > > > > > > > KIP-691:
> > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to
> > simplify
> > > > the
> > > > > > > > > exception
> > > > > > > > > > > > > handling
> > > > > > > > > > > > > > > >> logic
> > > > > > > > > > > > > > > >> >> > for
> > > > > > > > > > > > > > > >> >> > > > > >> >>>> transactional Producer
> > users
> > > by
> > > > > > > > > classifying
> > > > > > > > > > > > > fatal and
> > > > > > > > > > > > > > > >> >> > non-fatal
> > > > > > > > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > > > > > > > >> >> > > > > >> >>>> and throw them
> > > correspondingly
> > > > > for
> > > > > > > > easier
> > > > > > > > > > > catch
> > > > > > > > > > > > > and
> > > > > > > > > > > > > > > >> retry.
> > > > > > > > > > > > > > > >> >> > Let
> > > > > > > > > > > > > > > >> >> > > me
> > > > > > > > > > > > > > > >> >> > > > > >> know
> > > > > > > > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > > > --
> > > > > > > > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > >> >> > --
> > > > > > > > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >> > --
> > > > > > > > > > > > > > > >> > -- Guozhang
> > > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > > >> -- Guozhang
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > >
> >
> 

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
This is a great proposal Jason, I already integrated KIP-691 with KIP-706
template to provide a new ProduceFailedException as well as a new
Producer#produce API with CompletionStage as old send API replacement with
a richer return type. Let me know if you are good with this, and whether
Chia-Ping is also happy here :)

Best,
Boyang

On Fri, Jan 29, 2021 at 4:23 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Boyang,
>
> Ok, if we are going to ask all users to update their code anyway, maybe we
> can get more bang out of this.
>
> One of the remaining problems is that a user will only get the richer
> information if they are supplying a Callback. If they wait on the future,
> then they will get the usual exception. Our hands are somewhat tied here
> because of compatibility. One thing we have wanted to do for a long time is
> introduce a new send method which returns
> `CompletableFuture<RecordMetadata>` instead. If we did so, then we wouldn't
> need to change `Callback` at all. Instead we could make the contract of the
> new api that it will always return `SendFailedException`, which could
> provide a method to get the failure type.
>
> I think Chia-Ping has been thinking about this:
> https://issues.apache.org/jira/browse/KAFKA-12227. Is there a way to
> consolidate our efforts in order to avoid unnecessary noise for users? It
> would be awesome to see a more modern API which also addresses the
> confusing error handling.
>
> -Jason
>
>
>
>
>
>
>
> On Fri, Jan 29, 2021 at 3:48 PM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Thanks Jason, my hope is that we could streamline the callback function
> > signature, so that in the next major release (3.0),
> > onCompletion(RecordMetadata metadata, SendFailure sendFailure) will
> become
> > the only API to be implemented.
> >
> > On Fri, Jan 29, 2021 at 3:42 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi Boyang,
> > >
> > > Do you think it's necessary to deprecate the old `onCompletion`
> > callback? I
> > > was thinking there's probably no harm leaving it around. Users might
> not
> > > care about the failure type. Other than that, it looks good to me.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Thu, Jan 28, 2021 at 6:58 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Hey Guozhang,
> > > >
> > > > the CommitFailedException wrapping would cover all the non-fatal
> > > exceptions
> > > > as we listed out in the KIP, generally speaking any exception that
> > could
> > > > recover safely by calling abortTxn should be wrapped.
> > > >
> > > > Best,
> > > > Boyang
> > > >
> > > > On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > Hey Boyang,
> > > > >
> > > > > I think maybe there's something cracking here :) I'm just asking
> for
> > > > > clarifications that as of today, which non-fatal exceptions the
> newly
> > > > > introduced CommitFailedException would cover, and it seems to be
> only
> > > 1)
> > > > > unknown pid, 2) invalid pid mapping, and 3) concurrent
> transactions.
> > Is
> > > > > that correct?
> > > > >
> > > > > On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Hey Guozhang, I think TimeoutException would not be covered here
> as
> > > it
> > > > > > potentially has a risk of hitting an illegal state on the broker
> > side
> > > > > when
> > > > > > the previous commit was actually successful. Users should try to
> > > > increase
> > > > > > their max.block.ms to avoid hitting the timeout as a base
> > > suggestion,
> > > > > > which
> > > > > > is discussed in the KIP.
> > > > > >
> > > > > > As for the umbrella exception, I agree it has values to some
> > extent,
> > > > but
> > > > > in
> > > > > > terms of adoption difficulty, additional exception types usually
> > make
> > > > the
> > > > > > error handling more complex than simplified, and we are doing our
> > > best
> > > > to
> > > > > > avoid compatibility issues caused by wrapping previously thrown
> raw
> > > > > > exceptions that invalidates user error handling. And I'm also not
> > in
> > > > > favor
> > > > > > of our exception hierarchy today where all exceptions are
> > subclasses
> > > of
> > > > > > KafkaException, and we already mixed its handling with other
> > specific
> > > > > > exception types that could be caught as superclass
> KafkaException.
> > > That
> > > > > > could be addressed in a separate proposal if we see necessary.
> > > > > >
> > > > > > Boyang
> > > > > >
> > > > > > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <
> wangguoz@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Thanks Jason for the suggestion, that looks good to me too.
> > > > > > >
> > > > > > > Regarding the non-fatal exceptions wrapped as CommitFailed, I
> > would
> > > > > like
> > > > > > to
> > > > > > > clarify if we would cover all the following cases: 1) timeout,
> 2)
> > > > > unknown
> > > > > > > pid, 3) invalid pid mapping, 4) concurrent transactions?
> > > > > > >
> > > > > > > BTW I think it still makes sense to use an umbrella exception
> in
> > > case
> > > > > in
> > > > > > > the future we add more non-fatal cases even though today we
> only
> > > > have a
> > > > > > > few.
> > > > > > >
> > > > > > >
> > > > > > > Guozhang
> > > > > > >
> > > > > > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <
> > > > > reluctanthero104@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Jason, I agree with the proposed solution here, will
> > > update
> > > > > the
> > > > > > > KIP.
> > > > > > > >
> > > > > > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <
> > > > jason@confluent.io
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Boyang,
> > > > > > > > >
> > > > > > > > > It seems like a reasonable suggestion. I wonder if a flag
> is
> > > > > > sufficient
> > > > > > > > > though. The current `Callback` documentation treats "fatal"
> > > > errors
> > > > > > from
> > > > > > > > the
> > > > > > > > > perspective of the individual message that was sent.
> > > > > > > > >
> > > > > > > > > ```
> > > > > > > > >     *                  Non-Retriable exceptions (fatal, the
> > > > message
> > > > > > > will
> > > > > > > > > never be sent):
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > However, we also have fatal errors from the perspective of
> > the
> > > > > > > > transaction
> > > > > > > > > (e.g. when the producer gets fenced). Perhaps that suggests
> > we
> > > > need
> > > > > > > > > something richer than a boolean flag:
> > > > > > > > >
> > > > > > > > > At a high level, I think the following cases are possible:
> > > > > > > > >
> > > > > > > > > - message rejected (e.g. message too large, invalid topic)
> > > > > > > > > - delivery failed after retries/delivery timeout (e.g.
> > timeout,
> > > > crc
> > > > > > > > error,
> > > > > > > > > not enough replicas)
> > > > > > > > > - transaction failed (e.g. producer fenced, invalid
> > transaction
> > > > > > state)
> > > > > > > > >
> > > > > > > > > Perhaps instead we can have a type like the following:
> > > > > > > > >
> > > > > > > > > class SendFailure {
> > > > > > > > >   FailureType failureType;
> > > > > > > > >   Exception cause;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > enum FailureType {
> > > > > > > > >   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > (Not married to any of these names, just a starting point.)
> > > > > > > > >
> > > > > > > > > Then we add a new `onCompletion` as you've suggested:
> > > > > > > > >
> > > > > > > > > default void onCompletion(RecordMetadata metadata,
> > SendFailure
> > > > > > > failure) {
> > > > > > > > >   this.onCompletion(metadata, failure.cause());
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > This would give streams and other applications enough
> > > information
> > > > > to
> > > > > > > know
> > > > > > > > > whether the message can be retried and whether the
> > transaction
> > > > can
> > > > > be
> > > > > > > > > aborted.
> > > > > > > > >
> > > > > > > > > What do you think?
> > > > > > > > >
> > > > > > > > > -Jason
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks Jason for the thoughts.
> > > > > > > > > >
> > > > > > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <
> > > > > > jason@confluent.io
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Boyang,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the iterations here. I think this is
> something
> > > we
> > > > > > should
> > > > > > > > > have
> > > > > > > > > > > done a long time ago. It sounds like there are two API
> > > > changes
> > > > > > > here:
> > > > > > > > > > >
> > > > > > > > > > > 1. We are introducing the `CommitFailedException` to
> wrap
> > > > > > abortable
> > > > > > > > > > > errors that are raised from `commitTransaction`. This
> > > sounds
> > > > > fine
> > > > > > > to
> > > > > > > > > me.
> > > > > > > > > > As
> > > > > > > > > > > far as I know, the only case we might need this is when
> > we
> > > > add
> > > > > > > > support
> > > > > > > > > to
> > > > > > > > > > > let producers recover from coordinator timeouts. Are
> > there
> > > > any
> > > > > > > > others?
> > > > > > > > > > >
> > > > > > > > > > > I think the purpose here is to ensure non-fatal
> > exceptions
> > > > are
> > > > > > > > unified
> > > > > > > > > > under the same
> > > > > > > > > > exception umbrella, to make the proceeding to abort any
> > > ongoing
> > > > > > > > > transaction
> > > > > > > > > > much easier.
> > > > > > > > > > I don't think `coordinator timeouts` is the only case to
> > > > recover
> > > > > > > here,
> > > > > > > > > > since we have other
> > > > > > > > > > non-fatal exceptions such as UnknownPid.
> > > > > > > > > >
> > > > > > > > > > 2. We are wrapping non-fatal errors raised from `send` as
> > > > > > > > > `KafkaException`.
> > > > > > > > > > > The motivation for this is less clear to me and it
> > doesn't
> > > > look
> > > > > > > like
> > > > > > > > > the
> > > > > > > > > > > example from the KIP depends on it. My concern here is
> > > > > > > compatibility.
> > > > > > > > > > > Currently we have the following documentation for the
> > > > > `Callback`
> > > > > > > API:
> > > > > > > > > > >
> > > > > > > > > > > ```
> > > > > > > > > > >      *                  Non-Retriable exceptions
> (fatal,
> > > the
> > > > > > > message
> > > > > > > > > will
> > > > > > > > > > > never be sent):
> > > > > > > > > > >      *
> > > > > > > > > > >      *                  InvalidTopicException
> > > > > > > > > > >      *                  OffsetMetadataTooLargeException
> > > > > > > > > > >      *                  RecordBatchTooLargeException
> > > > > > > > > > >      *                  RecordTooLargeException
> > > > > > > > > > >      *                  UnknownServerException
> > > > > > > > > > >      *                  UnknownProducerIdException
> > > > > > > > > > >      *                  InvalidProducerEpochException
> > > > > > > > > > >      *
> > > > > > > > > > >      *                  Retriable exceptions
> (transient,
> > > may
> > > > be
> > > > > > > > covered
> > > > > > > > > > by
> > > > > > > > > > > increasing #.retries):
> > > > > > > > > > >      *
> > > > > > > > > > >      *                  CorruptRecordException
> > > > > > > > > > >      *                  InvalidMetadataException
> > > > > > > > > > >      *
> > > NotEnoughReplicasAfterAppendException
> > > > > > > > > > >      *                  NotEnoughReplicasException
> > > > > > > > > > >      *                  OffsetOutOfRangeException
> > > > > > > > > > >      *                  TimeoutException
> > > > > > > > > > >      *
> UnknownTopicOrPartitionException
> > > > > > > > > > > ```
> > > > > > > > > > >
> > > > > > > > > > > If we wrap all the retriable exceptions documented here
> > as
> > > > > > > > > > > `KafkaException`, wouldn't that break any error
> handling
> > > that
> > > > > > users
> > > > > > > > > might
> > > > > > > > > > > already have? it's gonna introduce a compatibility
> issue.
> > > > > > > > > > >
> > > > > > > > > > > The original intention was to simplify `send` callback
> > > error
> > > > > > > handling
> > > > > > > > > by
> > > > > > > > > > doing exception wrapping, as on Streams level
> > > > > > > > > > we have to prepare an exhausting list of exceptions to
> > catch
> > > as
> > > > > > > fatal,
> > > > > > > > > and
> > > > > > > > > > the same lengthy list to catch as
> > > > > > > > > > non-fatal. It would be much easier if we got `hints` from
> > the
> > > > > > > callback.
> > > > > > > > > > However,
> > > > > > > > > > I agree there is a compatibility concern, what about
> > > > deprecating
> > > > > > the
> > > > > > > > > > existing:
> > > > > > > > > >
> > > > > > > > > > void onCompletion(RecordMetadata metadata, Exception
> > > exception)
> > > > > > > > > >
> > > > > > > > > > and replace it with:
> > > > > > > > > >
> > > > > > > > > > default void onCompletion(RecordMetadata metadata,
> > Exception
> > > > > > > exception,
> > > > > > > > > > boolean isFatal) {
> > > > > > > > > >   this.onCompletion(metadata, exception);
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > to make sure new users get the benefit of understanding
> the
> > > > > > fatality
> > > > > > > > > based
> > > > > > > > > > on the info presented by the producer?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > > Jason
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <
> > > > info@hiringuru.com>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Why  we are receiving all emails kindly remove us
> from
> > > > > > > > > > > > dev@kafka.apache.org we don't want to receive emails
> > > > > anymore.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks
> > > > > > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <
> > > wangguoz@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks Boyang, yes I think I was confused about the
> > > > > different
> > > > > > > > > > handling
> > > > > > > > > > > of
> > > > > > > > > > > > > two abortTxn calls, and now I get it was not
> > > > intentional. I
> > > > > > > > think I
> > > > > > > > > > do
> > > > > > > > > > > > not
> > > > > > > > > > > > > have more concerns.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for the clarification Guozhang, I got your
> > > point
> > > > > > that
> > > > > > > we
> > > > > > > > > > want
> > > > > > > > > > > to
> > > > > > > > > > > > > > have a consistent handling of fatal exceptions
> > being
> > > > > thrown
> > > > > > > > from
> > > > > > > > > > the
> > > > > > > > > > > > > > abortTxn. I modified the current template to move
> > the
> > > > > fatal
> > > > > > > > > > exception
> > > > > > > > > > > > > > try-catch outside of the processing loop to make
> > sure
> > > > we
> > > > > > > could
> > > > > > > > > get
> > > > > > > > > > a
> > > > > > > > > > > > chance
> > > > > > > > > > > > > > to close consumer/producer modules. Let me know
> > what
> > > > you
> > > > > > > think.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best,
> > > > > > > > > > > > > > Boyang
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > My understanding is that abortTransaction would
> > > only
> > > > > > throw
> > > > > > > > when
> > > > > > > > > > the
> > > > > > > > > > > > > > > producer is in fatal state. For CommitFailed,
> the
> > > > > > producer
> > > > > > > > > should
> > > > > > > > > > > > still
> > > > > > > > > > > > > > be
> > > > > > > > > > > > > > > in the abortable error state, so that
> > > > abortTransaction
> > > > > > call
> > > > > > > > > would
> > > > > > > > > > > not
> > > > > > > > > > > > > > throw.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang
> <
> > > > > > > > > > wangguoz@gmail.com
> > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >> Or are you going to maintain some internal
> state
> > > > such
> > > > > > > that,
> > > > > > > > > the
> > > > > > > > > > > > > > >> `abortTransaction` in the catch block would
> > never
> > > > > throw
> > > > > > > > again?
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang
> Wang <
> > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > I've also thought about this (i.e. using
> > > > > CommitFailed
> > > > > > > for
> > > > > > > > > all
> > > > > > > > > > > > > > >> non-fatal),
> > > > > > > > > > > > > > >> > but what I'm pondering is that, in the catch
> > > > > > > > (CommitFailed)
> > > > > > > > > > > block,
> > > > > > > > > > > > > > what
> > > > > > > > > > > > > > >> > would happen if the
> > > `producer.abortTransaction();`
> > > > > > > throws
> > > > > > > > > > again?
> > > > > > > > > > > > > > should
> > > > > > > > > > > > > > >> > that be captured as a fatal and cause the
> > client
> > > > to
> > > > > > > close
> > > > > > > > > > again.
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > If yes, then naively the pattern would be:
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > > > > > > >> >         // Transaction commit failed with
> > > > abortable
> > > > > > > error,
> > > > > > > > > > user
> > > > > > > > > > > > could
> > > > > > > > > > > > > > >> reset
> > > > > > > > > > > > > > >> >         // the application state and resume
> > > with a
> > > > > new
> > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > >> > root
> > > > > > > > > > > > > > >> >         // cause was wrapped in the thrown
> > > > > exception.
> > > > > > > > > > > > > > >> >
> >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > >> >         try {
> > > > > > > > > > > > > > >> >             producer.abortTransaction();
> > > > > > > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > > > > > > >> >             producer.close();
> > > > > > > > > > > > > > >> >             consumer.close();
> > > > > > > > > > > > > > >> >             throw e;
> > > > > > > > > > > > > > >> >         }
> > > > > > > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > > > > > > >> >         producer.close();
> > > > > > > > > > > > > > >> >         consumer.close();
> > > > > > > > > > > > > > >> >         throw e;
> > > > > > > > > > > > > > >> >     }
> > > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > Guozhang
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang
> Chen <
> > > > > > > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> Jason and I were discussing the new API
> > offline
> > > > and
> > > > > > > > decided
> > > > > > > > > > to
> > > > > > > > > > > > take
> > > > > > > > > > > > > > >> >> another
> > > > > > > > > > > > > > >> >> approach. Firstly, the reason not to
> invent a
> > > new
> > > > > API
> > > > > > > > with
> > > > > > > > > > > > returned
> > > > > > > > > > > > > > >> >> boolean
> > > > > > > > > > > > > > >> >> flag is for compatibility consideration,
> > since
> > > > old
> > > > > > EOS
> > > > > > > > code
> > > > > > > > > > > > would not
> > > > > > > > > > > > > > >> know
> > > > > > > > > > > > > > >> >> that a given transaction commit was failed
> > > > > internally
> > > > > > > as
> > > > > > > > > they
> > > > > > > > > > > > don't
> > > > > > > > > > > > > > >> listen
> > > > > > > > > > > > > > >> >> to the output flag. Our proposed
> alternative
> > > > > solution
> > > > > > > is
> > > > > > > > to
> > > > > > > > > > let
> > > > > > > > > > > > > > >> >> *commitTransaction
> > > > > > > > > > > > > > >> >> throw CommitFailedException whenever the
> > commit
> > > > > > failed
> > > > > > > > with
> > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > >> >> exception*, so that on the caller side the
> > > > handling
> > > > > > > logic
> > > > > > > > > > > > becomes:
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> try {
> > > > > > > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > > > > > > >> >>             producer.commitTransaction();
> > > > > > > > > > > > > > >> >>         } else {
> > > > > > > > > > > > > > >> >>
> > > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > >> >>             producer.abortTransaction();
> > > > > > > > > > > > > > >> >>         }
> > > > > > > > > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > > > > > > > > >> >>         // Transaction commit failed with
> > > > abortable
> > > > > > > > error,
> > > > > > > > > > user
> > > > > > > > > > > > could
> > > > > > > > > > > > > > >> >> reset
> > > > > > > > > > > > > > >> >>         // the application state and resume
> > > with
> > > > a
> > > > > > new
> > > > > > > > > > > > transaction.
> > > > > > > > > > > > > > The
> > > > > > > > > > > > > > >> >> root
> > > > > > > > > > > > > > >> >>         // cause was wrapped in the thrown
> > > > > exception.
> > > > > > > > > > > > > > >> >>
> > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > > > > > > >> >>         producer.close();
> > > > > > > > > > > > > > >> >>         consumer.close();
> > > > > > > > > > > > > > >> >>         throw e;
> > > > > > > > > > > > > > >> >>     }
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> This approach looks cleaner as all
> exception
> > > > types
> > > > > > > other
> > > > > > > > > than
> > > > > > > > > > > > > > >> CommitFailed
> > > > > > > > > > > > > > >> >> will doom to be fatal, which is very easy
> to
> > > > adopt
> > > > > > for
> > > > > > > > > users.
> > > > > > > > > > > In
> > > > > > > > > > > > the
> > > > > > > > > > > > > > >> >> meantime, we still maintain the commitTxn
> > > > behavior
> > > > > to
> > > > > > > > throw
> > > > > > > > > > > > instead
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > >> >> silently failing.
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> In addition, we decided to drop the
> > > > recommendation
> > > > > to
> > > > > > > > > handle
> > > > > > > > > > > > > > >> >> TimeoutException and leave it to the users
> to
> > > > make
> > > > > > the
> > > > > > > > > call.
> > > > > > > > > > > The
> > > > > > > > > > > > > > >> downside
> > > > > > > > > > > > > > >> >> for blindly calling abortTxn upon timeout
> is
> > > that
> > > > > we
> > > > > > > > could
> > > > > > > > > > > > result in
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > >> >> illegal state when the commit was already
> > > > > successful
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > > broker
> > > > > > > > > > > > > > >> >> side. Without a good guarantee on the
> > outcome,
> > > > > > > > > > overcomplicating
> > > > > > > > > > > > the
> > > > > > > > > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> Let me know your thoughts on the new
> approach
> > > > here,
> > > > > > > thank
> > > > > > > > > > you!
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> Best,
> > > > > > > > > > > > > > >> >> Boyang
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang
> > Wang
> > > <
> > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > >> >> wrote:
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >> > Thanks for your clarification on 2)/3),
> > that
> > > > > makes
> > > > > > > > sense.
> > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang
> > Chen
> > > <
> > > > > > > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > >> >> > > Thanks for the input Guozhang, replied
> > > > inline.
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM
> Guozhang
> > > > Wang <
> > > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > Thanks for the updated KIP. I read it
> > > again
> > > > > and
> > > > > > > > have
> > > > > > > > > > the
> > > > > > > > > > > > > > >> following
> > > > > > > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > 0. I'm a bit concerned that if
> > commitTxn
> > > > does
> > > > > > not
> > > > > > > > > throw
> > > > > > > > > > > any
> > > > > > > > > > > > > > >> >> non-fatal
> > > > > > > > > > > > > > >> >> > > > exception, and instead we rely on the
> > > > > > subsequent
> > > > > > > > > > beginTxn
> > > > > > > > > > > > call
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > >> >> > throw,
> > > > > > > > > > > > > > >> >> > > it
> > > > > > > > > > > > > > >> >> > > > may violate some callers with a
> pattern
> > > > that
> > > > > > > > relying
> > > > > > > > > on
> > > > > > > > > > > > > > >> commitTxn to
> > > > > > > > > > > > > > >> >> > > > succeed to make some non-rollback
> > > > operations.
> > > > > > For
> > > > > > > > > > > example:
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > > > > > > >> >> > > > // if commitTxn succeeds, then commit
> > the
> > > > DB
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > -------------
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > The issue is that, committing DB is a
> > > > > > > non-rollback
> > > > > > > > > > > > operation,
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > >> >> users
> > > > > > > > > > > > > > >> >> > > may
> > > > > > > > > > > > > > >> >> > > > just rely on commitTxn to return
> > without
> > > > > error
> > > > > > to
> > > > > > > > > make
> > > > > > > > > > > this
> > > > > > > > > > > > > > >> >> > non-rollback
> > > > > > > > > > > > > > >> >> > > > call. Of course we can just claim
> this
> > > > > pattern
> > > > > > is
> > > > > > > > not
> > > > > > > > > > > > > > legitimate
> > > > > > > > > > > > > > >> >> and is
> > > > > > > > > > > > > > >> >> > > not
> > > > > > > > > > > > > > >> >> > > > the right way of doing things, but
> many
> > > > users
> > > > > > may
> > > > > > > > > > > naturally
> > > > > > > > > > > > > > adopt
> > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > >> >> > > > pattern.
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > So maybe we should still let
> commitTxn
> > > also
> > > > > > throw
> > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > >> >> exceptions,
> > > > > > > > > > > > > > >> >> > > in
> > > > > > > > > > > > > > >> >> > > > which case we would then call
> abortTxn
> > > > again.
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > But if we do this, the pattern
> becomes:
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > > > > > > >> >> > > >     try {
> > > > > > > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > > > > > > >> >> > > >     } catch (...) {        //
> enumerate
> > > all
> > > > > > fatal
> > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > Thank makes me think, the alternative
> > we
> > > > have
> > > > > > > > > discussed
> > > > > > > > > > > > offline
> > > > > > > > > > > > > > >> may
> > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > >> >> > > > better: let commitTxn() to return a
> > > boolean
> > > > > > flag.
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > * If it returns true, it means the
> > commit
> > > > > > > > succeeded.
> > > > > > > > > > > Users
> > > > > > > > > > > > can
> > > > > > > > > > > > > > >> >> > > comfortably
> > > > > > > > > > > > > > >> >> > > > continue and do any external
> > non-rollback
> > > > > > > > operations
> > > > > > > > > if
> > > > > > > > > > > > they
> > > > > > > > > > > > > > >> like.
> > > > > > > > > > > > > > >> >> > > > * If it returns false, it means the
> > > commit
> > > > > > failed
> > > > > > > > > with
> > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > >> >> error
> > > > > > > > > > > > > > >> >> > > *AND
> > > > > > > > > > > > > > >> >> > > > the txn has been aborted*. Users do
> not
> > > > need
> > > > > to
> > > > > > > > call
> > > > > > > > > > > > abortTxn
> > > > > > > > > > > > > > >> again.
> > > > > > > > > > > > > > >> >> > > > * If it throws, then it means fatal
> > > errors.
> > > > > > Users
> > > > > > > > > > should
> > > > > > > > > > > > shut
> > > > > > > > > > > > > > >> down
> > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > >> >> > > > client.
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > > > > > > > > > >> >> > > >     } else {
> > > > > > > > > > > > > > >> >> > > >         // reset offsets, rollback
> > > > > operations,
> > > > > > > etc
> > > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > > > > > > >> >> > > >    // do other non-rollbackable
> things
> > > > > > > > > > > > > > >> >> > > > else
> > > > > > > > > > > > > > >> >> > > >    // reset offsets, rollback
> > operations,
> > > > etc
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > Of course, if we want to have better
> > > > > visibility
> > > > > > > > into
> > > > > > > > > > what
> > > > > > > > > > > > > > caused
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> >> > > commit
> > > > > > > > > > > > > > >> >> > > > to fail and txn to abort. We can let
> > the
> > > > > return
> > > > > > > > type
> > > > > > > > > be
> > > > > > > > > > > an
> > > > > > > > > > > > enum
> > > > > > > > > > > > > > >> >> instead
> > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > >> >> > > > a flag. But my main idea is to still
> > let
> > > > the
> > > > > > > > > commitTxn
> > > > > > > > > > be
> > > > > > > > > > > > the
> > > > > > > > > > > > > > >> final
> > > > > > > > > > > > > > >> >> > point
> > > > > > > > > > > > > > >> >> > > > users can tell whether this txn
> > succeeded
> > > > or
> > > > > > not,
> > > > > > > > > > instead
> > > > > > > > > > > > of
> > > > > > > > > > > > > > >> >> relying on
> > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > I agree that adding a boolean flag is
> > > > indeed
> > > > > > > useful
> > > > > > > > > in
> > > > > > > > > > > this
> > > > > > > > > > > > > > case.
> > > > > > > > > > > > > > >> >> Will
> > > > > > > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> > > 1. Re: "while maintaining the behavior
> to
> > > > throw
> > > > > > > fatal
> > > > > > > > > > > > exception
> > > > > > > > > > > > > > in
> > > > > > > > > > > > > > >> >> raw"
> > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > >> >> > > > sure what do you mean by "throw"
> here.
> > > Are
> > > > > you
> > > > > > > > > > proposing
> > > > > > > > > > > > the
> > > > > > > > > > > > > > >> >> callback
> > > > > > > > > > > > > > >> >> > > would
> > > > > > > > > > > > > > >> >> > > > *pass* (not throw) in any fatal
> > > exceptions
> > > > > when
> > > > > > > > > > triggered
> > > > > > > > > > > > > > without
> > > > > > > > > > > > > > >> >> > > wrapping?
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > Yes, I want to say *pass*, the
> benefit
> > is
> > > > to
> > > > > > make
> > > > > > > > the
> > > > > > > > > > end
> > > > > > > > > > > > > > user's
> > > > > > > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> > > > 2. I'm not sure I understand the
> claim
> > > > > > regarding
> > > > > > > > the
> > > > > > > > > > > > callback
> > > > > > > > > > > > > > >> that
> > > > > > > > > > > > > > >> >> "In
> > > > > > > > > > > > > > >> >> > > EOS
> > > > > > > > > > > > > > >> >> > > > setup, it is not required to handle
> the
> > > > > > > exception".
> > > > > > > > > Are
> > > > > > > > > > > you
> > > > > > > > > > > > > > >> >> proposing
> > > > > > > > > > > > > > >> >> > > that,
> > > > > > > > > > > > > > >> >> > > > e.g. in Streams, we do not try to
> > handle
> > > > any
> > > > > > > > > exceptions
> > > > > > > > > > > if
> > > > > > > > > > > > EOS
> > > > > > > > > > > > > > is
> > > > > > > > > > > > > > >> >> > enabled
> > > > > > > > > > > > > > >> >> > > > in the callback anymore, but just let
> > > > > > commitTxn()
> > > > > > > > > > itself
> > > > > > > > > > > to
> > > > > > > > > > > > > > fail
> > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > > >> >> > > > notified about the problem?
> Personally
> > I
> > > > > think
> > > > > > in
> > > > > > > > > > Streams
> > > > > > > > > > > > we
> > > > > > > > > > > > > > >> should
> > > > > > > > > > > > > > >> >> > just
> > > > > > > > > > > > > > >> >> > > > make the handling logic of the
> callback
> > > to
> > > > be
> > > > > > > > > > consistent
> > > > > > > > > > > > > > >> regardless
> > > > > > > > > > > > > > >> >> of
> > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > >> >> > > > EOS settings (today we have different
> > > logic
> > > > > > > > depending
> > > > > > > > > > on
> > > > > > > > > > > > this
> > > > > > > > > > > > > > >> logic,
> > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > >> >> > > > I think could be unified as well).
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > My idea originates from the claim on
> > send
> > > > > API:
> > > > > > > > > > > > > > >> >> > > "When used as part of a transaction, it
> > is
> > > > not
> > > > > > > > > necessary
> > > > > > > > > > to
> > > > > > > > > > > > > > define
> > > > > > > > > > > > > > >> a
> > > > > > > > > > > > > > >> >> > > callback or check the result of the
> > future
> > > > in
> > > > > > > order
> > > > > > > > to
> > > > > > > > > > > > detect
> > > > > > > > > > > > > > >> errors
> > > > > > > > > > > > > > >> >> > from
> > > > > > > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > > > > > > >> >> > > My understanding is that for EOS, the
> > > > exception
> > > > > > > will
> > > > > > > > be
> > > > > > > > > > > > detected
> > > > > > > > > > > > > > by
> > > > > > > > > > > > > > >> >> > calling
> > > > > > > > > > > > > > >> >> > > transactional APIs either way, so it
> is a
> > > > > > duplicate
> > > > > > > > > > > handling
> > > > > > > > > > > > to
> > > > > > > > > > > > > > >> track
> > > > > > > > > > > > > > >> >> > > all the sendExceptions in
> > RecordCollector.
> > > > > > > However, I
> > > > > > > > > > > looked
> > > > > > > > > > > > up
> > > > > > > > > > > > > > >> >> > > sendException is being used today as
> > > follow:
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler"
> > for
> > > > any
> > > > > > > > default
> > > > > > > > > > or
> > > > > > > > > > > > > > >> customized
> > > > > > > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > > > > > > >> >> > > 2. Stop collecting offset info or new
> > > > > exceptions
> > > > > > > > > > > > > > >> >> > > 3. Check and rethrow exceptions in
> > close(),
> > > > > > flush()
> > > > > > > > or
> > > > > > > > > > new
> > > > > > > > > > > > send()
> > > > > > > > > > > > > > >> >> calls
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> > > To my understanding, we should still
> > honor
> > > > the
> > > > > > > > > commitment
> > > > > > > > > > > to
> > > > > > > > > > > > #1
> > > > > > > > > > > > > > for
> > > > > > > > > > > > > > >> >> any
> > > > > > > > > > > > > > >> >> > > user customized implementation. The #2
> > does
> > > > not
> > > > > > > > really
> > > > > > > > > > > > affect our
> > > > > > > > > > > > > > >> >> > decision
> > > > > > > > > > > > > > >> >> > > upon EOS. The #3 here is still valuable
> > as
> > > it
> > > > > > could
> > > > > > > > > help
> > > > > > > > > > us
> > > > > > > > > > > > fail
> > > > > > > > > > > > > > >> fast
> > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > >> >> > > new send() instead of waiting to later
> > > stage
> > > > of
> > > > > > > > > > processing.
> > > > > > > > > > > > In
> > > > > > > > > > > > > > that
> > > > > > > > > > > > > > >> >> > sense,
> > > > > > > > > > > > > > >> >> > > I agree we should continue to record
> send
> > > > > > > exceptions
> > > > > > > > > even
> > > > > > > > > > > > under
> > > > > > > > > > > > > > EOS
> > > > > > > > > > > > > > >> >> case
> > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > >> >> > > ensure the strength of stream side
> > Producer
> > > > > > logic.
> > > > > > > On
> > > > > > > > > the
> > > > > > > > > > > > safer
> > > > > > > > > > > > > > >> side,
> > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > >> >> > no
> > > > > > > > > > > > > > >> >> > > longer need to wrap certain fatal
> > > exceptions
> > > > > like
> > > > > > > > > > > > ProducerFenced
> > > > > > > > > > > > > > as
> > > > > > > > > > > > > > >> >> > > TaskMigrated, since they should not
> crash
> > > the
> > > > > > > stream
> > > > > > > > > > thread
> > > > > > > > > > > > if
> > > > > > > > > > > > > > >> thrown
> > > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > > >> >> > > raw format, once we adopt the new
> > > processing
> > > > > > model
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > > send
> > > > > > > > > > > > > > >> phase.
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM
> Boyang
> > > > Chen <
> > > > > > > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > > Thanks for everyone's feedback so
> > far.
> > > I
> > > > > have
> > > > > > > > > > polished
> > > > > > > > > > > > the
> > > > > > > > > > > > > > KIP
> > > > > > > > > > > > > > >> >> after
> > > > > > > > > > > > > > >> >> > > > > offline discussion with some folks
> > > > working
> > > > > on
> > > > > > > EOS
> > > > > > > > > to
> > > > > > > > > > > > make the
> > > > > > > > > > > > > > >> >> > exception
> > > > > > > > > > > > > > >> >> > > > > handling more lightweight. The
> > > essential
> > > > > > change
> > > > > > > > is
> > > > > > > > > > that
> > > > > > > > > > > > we
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > >> not
> > > > > > > > > > > > > > >> >> > > > > inventing a new intermediate
> > exception
> > > > > type,
> > > > > > > but
> > > > > > > > > > > instead
> > > > > > > > > > > > > > >> >> separating a
> > > > > > > > > > > > > > >> >> > > > full
> > > > > > > > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > >> >> > > > > This way for any exception thrown
> > from
> > > > > phase
> > > > > > 1,
> > > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > > an
> > > > > > > > > > > > > > >> >> indicator
> > > > > > > > > > > > > > >> >> > in
> > > > > > > > > > > > > > >> >> > > > > phase 2 whether we should do commit
> > or
> > > > > abort,
> > > > > > > and
> > > > > > > > > > from
> > > > > > > > > > > > now on
> > > > > > > > > > > > > > >> >> > > > > `commitTransaction` should only
> throw
> > > > fatal
> > > > > > > > > > exceptions,
> > > > > > > > > > > > > > >> similar to
> > > > > > > > > > > > > > >> >> > > > > `abortTransaction`, so that any
> > > > > > KafkaException
> > > > > > > > > caught
> > > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > >> >> commit
> > > > > > > > > > > > > > >> >> > > phase
> > > > > > > > > > > > > > >> >> > > > > will be definitely fatal to crash
> the
> > > > app.
> > > > > > For
> > > > > > > > more
> > > > > > > > > > > > advanced
> > > > > > > > > > > > > > >> users
> > > > > > > > > > > > > > >> >> > such
> > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > >> >> > > > > Streams, we have the ability to
> > further
> > > > > wrap
> > > > > > > > > selected
> > > > > > > > > > > > types
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > >> >> > > > > exceptions to trigger task
> migration
> > if
> > > > > > > > necessary.
> > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > >> >> > > > > More details in the KIP, feel free
> to
> > > > take
> > > > > > > > another
> > > > > > > > > > > look,
> > > > > > > > > > > > > > >> thanks!
> > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM
> > Boyang
> > > > > Chen <
> > > > > > > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM
> > Bruno
> > > > > > Cadonna
> > > > > > > <
> > > > > > > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >> Like Matthias, I do also not
> know
> > > the
> > > > > > > producer
> > > > > > > > > > > > internal
> > > > > > > > > > > > > > well
> > > > > > > > > > > > > > >> >> > enough
> > > > > > > > > > > > > > >> >> > > to
> > > > > > > > > > > > > > >> >> > > > > >> comment on the categorization.
> > > > However,
> > > > > I
> > > > > > > > think
> > > > > > > > > > > > having a
> > > > > > > > > > > > > > >> super
> > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > >> >> > > > > >> (e.g. RetriableException) that
> > > > encodes
> > > > > if
> > > > > > > an
> > > > > > > > > > > > exception is
> > > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > >> >> > > > not
> > > > > > > > > > > > > > >> >> > > > > >> is cleaner, better
> understandable
> > > and
> > > > > less
> > > > > > > > > > > > error-prone,
> > > > > > > > > > > > > > >> because
> > > > > > > > > > > > > > >> >> > > > ideally
> > > > > > > > > > > > > > >> >> > > > > >> when you add a new non-fatal
> > > exception
> > > > > in
> > > > > > > > future
> > > > > > > > > > you
> > > > > > > > > > > > just
> > > > > > > > > > > > > > >> need
> > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > >> >> > > > > >> about letting it inherit from
> the
> > > > super
> > > > > > > > > exception
> > > > > > > > > > > and
> > > > > > > > > > > > all
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> >> rest
> > > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > > >> >> > > > > >> the code will just behave
> > correctly
> > > > > > without
> > > > > > > > the
> > > > > > > > > > need
> > > > > > > > > > > > to
> > > > > > > > > > > > > > wrap
> > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > >> >> > new
> > > > > > > > > > > > > > >> >> > > > > >> exception into another exception
> > > each
> > > > > time
> > > > > > > it
> > > > > > > > is
> > > > > > > > > > > > thrown
> > > > > > > > > > > > > > >> (maybe
> > > > > > > > > > > > > > >> >> it
> > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > >> >> > > > > >> thrown at different location in
> > the
> > > > > code).
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >> As far as I understand the
> > following
> > > > > > > statement
> > > > > > > > > > from
> > > > > > > > > > > > your
> > > > > > > > > > > > > > >> >> previous
> > > > > > > > > > > > > > >> >> > > > e-mail
> > > > > > > > > > > > > > >> >> > > > > >> is the reason that currently
> such
> > a
> > > > > super
> > > > > > > > > > exception
> > > > > > > > > > > > is not
> > > > > > > > > > > > > > >> >> > possible:
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >> "Right now we have
> > > RetriableException
> > > > > > type,
> > > > > > > if
> > > > > > > > > we
> > > > > > > > > > > are
> > > > > > > > > > > > > > going
> > > > > > > > > > > > > > >> to
> > > > > > > > > > > > > > >> >> > add a
> > > > > > > > > > > > > > >> >> > > > > >> `ProducerRetriableException`
> type,
> > > we
> > > > > have
> > > > > > > to
> > > > > > > > > put
> > > > > > > > > > > > this new
> > > > > > > > > > > > > > >> >> > interface
> > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > >> >> > > > > >> the parent of the
> > > RetriableException,
> > > > > > > because
> > > > > > > > > not
> > > > > > > > > > > all
> > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > >> >> > > > > >> exceptions are `retriable` in
> > > general
> > > > > for
> > > > > > > > > > producer"
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >> In the list of exceptions in
> your
> > > > KIP, I
> > > > > > > found
> > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > >> >> > exceptions
> > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > >> >> > > > > >> do not inherit from
> > > > RetriableException.
> > > > > I
> > > > > > > > guess
> > > > > > > > > > > those
> > > > > > > > > > > > are
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> >> ones
> > > > > > > > > > > > > > >> >> > > you
> > > > > > > > > > > > > > >> >> > > > > >> are referring to in your
> > statement:
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > > > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >> All of those exceptions are
> > > non-fatal
> > > > > and
> > > > > > do
> > > > > > > > not
> > > > > > > > > > > > inherit
> > > > > > > > > > > > > > >> from
> > > > > > > > > > > > > > >> >> > > > > >> RetriableException. Is there a
> > > reason
> > > > > for
> > > > > > > > that?
> > > > > > > > > If
> > > > > > > > > > > > they
> > > > > > > > > > > > > > >> >> depended
> > > > > > > > > > > > > > >> >> > > from
> > > > > > > > > > > > > > >> >> > > > > >> RetriableException we would be a
> > bit
> > > > > > closer
> > > > > > > > to a
> > > > > > > > > > > super
> > > > > > > > > > > > > > >> >> exception I
> > > > > > > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >> The reason is that sender may
> > catch
> > > > > those
> > > > > > > > > > exceptions
> > > > > > > > > > > > in
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > >> >> > > > > > ProduceResponse, and it currently
> > > does
> > > > > > > infinite
> > > > > > > > > > > > > > >> >> > > > > > retries on RetriableException. To
> > > make
> > > > > sure
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > > > > > trigger
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> >> > > > > > abortTransaction() by catching
> > > > non-fatal
> > > > > > > thrown
> > > > > > > > > > > > > > >> >> > > > > > exceptions and reinitialize the
> txn
> > > > > state,
> > > > > > we
> > > > > > > > > chose
> > > > > > > > > > > > not to
> > > > > > > > > > > > > > >> let
> > > > > > > > > > > > > > >> >> > those
> > > > > > > > > > > > > > >> >> > > > > > exceptions inherit
> > RetriableException
> > > > so
> > > > > > that
> > > > > > > > > > > > > > >> >> > > > > > they won't cause infinite retry
> on
> > > > > sender.
> > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > >> >> > > > > >> With OutOfOrderSequenceException
> > and
> > > > > > > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > > >> >> > > > > >> to understand that their
> fatality
> > > > > depends
> > > > > > on
> > > > > > > > the
> > > > > > > > > > > type
> > > > > > > > > > > > > > (i.e.
> > > > > > > > > > > > > > >> >> > > > > >> configuration) of the producer.
> > That
> > > > > makes
> > > > > > > it
> > > > > > > > > > > > difficult to
> > > > > > > > > > > > > > >> >> have a
> > > > > > > > > > > > > > >> >> > > > super
> > > > > > > > > > > > > > >> >> > > > > >> exception that encodes the
> > > > retriability
> > > > > as
> > > > > > > > > > mentioned
> > > > > > > > > > > > > > above.
> > > > > > > > > > > > > > >> >> Would
> > > > > > > > > > > > > > >> >> > it
> > > > > > > > > > > > > > >> >> > > > be
> > > > > > > > > > > > > > >> >> > > > > >> possible to introduce exceptions
> > > that
> > > > > > > inherit
> > > > > > > > > from
> > > > > > > > > > > > > > >> >> > > RetriableException
> > > > > > > > > > > > > > >> >> > > > > >> and that are thrown when those
> > > > > exceptions
> > > > > > > are
> > > > > > > > > > caught
> > > > > > > > > > > > from
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> >> > > brokers
> > > > > > > > > > > > > > >> >> > > > > >> and the type of the producer is
> > such
> > > > > that
> > > > > > > the
> > > > > > > > > > > > exceptions
> > > > > > > > > > > > > > are
> > > > > > > > > > > > > > >> >> > > > retriable?
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >> Yea, I think in general the
> > > exception
> > > > > type
> > > > > > > > > mixing
> > > > > > > > > > is
> > > > > > > > > > > > > > >> difficult
> > > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > > >> >> > > get
> > > > > > > > > > > > > > >> >> > > > > > them all right. I have already
> > > proposed
> > > > > > > another
> > > > > > > > > > > > solution
> > > > > > > > > > > > > > >> based
> > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > >> >> > my
> > > > > > > > > > > > > > >> >> > > > > > offline discussion with some
> folks
> > > > > working
> > > > > > on
> > > > > > > > EOS
> > > > > > > > > > > > > > >> >> > > > > > to make the handling more
> > > > straightforward
> > > > > > for
> > > > > > > > end
> > > > > > > > > > > users
> > > > > > > > > > > > > > >> without
> > > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > > >> >> > > > need
> > > > > > > > > > > > > > >> >> > > > > > to distinguish exception
> fatality.
> > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang
> > > > wrote:
> > > > > > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the
> proposal!
> > I
> > > > > made a
> > > > > > > > pass
> > > > > > > > > > over
> > > > > > > > > > > > the
> > > > > > > > > > > > > > >> list
> > > > > > > > > > > > > > >> >> and
> > > > > > > > > > > > > > >> >> > > here
> > > > > > > > > > > > > > >> >> > > > > are
> > > > > > > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> > 0) Although this is not part
> of
> > > the
> > > > > > public
> > > > > > > > > API,
> > > > > > > > > > I
> > > > > > > > > > > > think
> > > > > > > > > > > > > > we
> > > > > > > > > > > > > > >> >> > should
> > > > > > > > > > > > > > >> >> > > > make
> > > > > > > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > > > > > > >> >> > > > > >> > that the suggested pattern
> (i.e.
> > > > user
> > > > > > can
> > > > > > > > > always
> > > > > > > > > > > > call
> > > > > > > > > > > > > > >> >> abortTxn()
> > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > >> >> > > > > >> > handling non-fatal errors) are
> > > > indeed
> > > > > > > > > supported.
> > > > > > > > > > > > E.g. if
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> >> txn
> > > > > > > > > > > > > > >> >> > > is
> > > > > > > > > > > > > > >> >> > > > > >> already
> > > > > > > > > > > > > > >> >> > > > > >> > aborted by the broker side,
> then
> > > > users
> > > > > > can
> > > > > > > > > still
> > > > > > > > > > > > call
> > > > > > > > > > > > > > >> >> abortTxn
> > > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > > >> >> > > > > >> would
> > > > > > > > > > > > > > >> >> > > > > >> > not throw another exception
> but
> > > just
> > > > > be
> > > > > > > > > treated
> > > > > > > > > > > as a
> > > > > > > > > > > > > > >> no-op.
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> > 1)
> > > > *ConcurrentTransactionsException*:
> > > > > I
> > > > > > > > think
> > > > > > > > > > this
> > > > > > > > > > > > error
> > > > > > > > > > > > > > >> can
> > > > > > > > > > > > > > >> >> > also
> > > > > > > > > > > > > > >> >> > > be
> > > > > > > > > > > > > > >> >> > > > > >> > returned but not documented
> yet.
> > > > This
> > > > > > > should
> > > > > > > > > be
> > > > > > > > > > a
> > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > >> >> > error.
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*:
> > > this
> > > > > > error
> > > > > > > is
> > > > > > > > > > > > returned
> > > > > > > > > > > > > > from
> > > > > > > > > > > > > > >> >> > broker
> > > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > > > > > > >> >> > > > > >> > state transition failed (e.g.
> it
> > > is
> > > > > > trying
> > > > > > > > to
> > > > > > > > > > > > transit to
> > > > > > > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > > > > > > >> >> > > > > >> > while the current state is not
> > > > > > > > > prepare-commit).
> > > > > > > > > > > This
> > > > > > > > > > > > > > error
> > > > > > > > > > > > > > >> >> could
> > > > > > > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > > > > > > >> >> > > > > >> > a bug on the client internal
> > code
> > > or
> > > > > the
> > > > > > > > > broker
> > > > > > > > > > > > code,
> > > > > > > > > > > > > > OR a
> > > > > > > > > > > > > > >> >> user
> > > > > > > > > > > > > > >> >> > > > error
> > > > > > > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > > > > > > >> >> > > > > >> > similar error is
> > > > > > > > > > ConcurrentTransactionsException,
> > > > > > > > > > > > i.e.
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > >> >> Kafka
> > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > > > > > > >> >> > > > > >> > these exceptions would only be
> > > > > returned
> > > > > > if
> > > > > > > > > users
> > > > > > > > > > > > try to
> > > > > > > > > > > > > > do
> > > > > > > > > > > > > > >> >> > > something
> > > > > > > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn right
> > after
> > > a
> > > > > > > > commitTxn,
> > > > > > > > > > > etc.
> > > > > > > > > > > > So
> > > > > > > > > > > > > > I'm
> > > > > > > > > > > > > > >> >> > > thinking
> > > > > > > > > > > > > > >> >> > > > it
> > > > > > > > > > > > > > >> >> > > > > >> > should be a non-fatal error
> > > instead
> > > > > of a
> > > > > > > > fatal
> > > > > > > > > > > > error,
> > > > > > > > > > > > > > >> wdyt?
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i
> > > > "indicates
> > > > > > > fatal
> > > > > > > > > > > > > > transactional
> > > > > > > > > > > > > > >> >> > > sequence
> > > > > > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit
> > > conflicting
> > > > > with
> > > > > > > the
> > > > > > > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > > > > > > >> >> > > > > >> that
> > > > > > > > > > > > > > >> >> > > > > >> > is treated as non-fatal. I
> guess
> > > > your
> > > > > > > > proposal
> > > > > > > > > > is
> > > > > > > > > > > > that
> > > > > > > > > > > > > > >> >> > > > > >> > OutOfOrderSequenceException
> > would
> > > be
> > > > > > > treated
> > > > > > > > > > > either
> > > > > > > > > > > > as
> > > > > > > > > > > > > > >> fatal
> > > > > > > > > > > > > > >> >> > with
> > > > > > > > > > > > > > >> >> > > > > >> > transactional producer, or
> > > non-fatal
> > > > > > with
> > > > > > > > > > > idempotent
> > > > > > > > > > > > > > >> >> producer,
> > > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > > >> >> > > > > >> > right? If the producer is only
> > > > > > configured
> > > > > > > > with
> > > > > > > > > > > > > > idempotency
> > > > > > > > > > > > > > >> >> but
> > > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > > > > > > >> >> > > for
> > > > > > > > > > > > > > >> >> > > > > >> > non-fatal errors would be
> > > confusing
> > > > > > since
> > > > > > > > > users
> > > > > > > > > > > are
> > > > > > > > > > > > not
> > > > > > > > > > > > > > >> using
> > > > > > > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > > > > > > >> >> > > > > >> > at all.. So I suggest we
> always
> > > > throw
> > > > > > > > > > > > > > >> OutOfSequenceException
> > > > > > > > > > > > > > >> >> > as-is
> > > > > > > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > > > > > > >> >> > > > > >> > not wrapped) no matter how the
> > > > > producer
> > > > > > is
> > > > > > > > > > > > configured,
> > > > > > > > > > > > > > and
> > > > > > > > > > > > > > >> >> let
> > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > > > > > > >> >> > > > > >> > decide how to handle it based
> on
> > > > > whether
> > > > > > > it
> > > > > > > > is
> > > > > > > > > > > only
> > > > > > > > > > > > > > >> >> idempotent
> > > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs,
> the
> > > > > > `send()`
> > > > > > > > > > > callback /
> > > > > > > > > > > > > > >> future
> > > > > > > > > > > > > > >> >> can
> > > > > > > > > > > > > > >> >> > > also
> > > > > > > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > > > > > > >> >> > > > > >> > txn-related exceptions, I
> think
> > > this
> > > > > KIP
> > > > > > > > > should
> > > > > > > > > > > also
> > > > > > > > > > > > > > cover
> > > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > > >> >> > > API
> > > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> > 5) This is related to 1/2)
> > above:
> > > > > > > sometimes
> > > > > > > > > > those
> > > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > >> >> > > > > like
> > > > > > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or
> InvalidTxnState
> > > are
> > > > > not
> > > > > > > due
> > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > > > > state
> > > > > > > > > > > > > > >> >> being
> > > > > > > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > > > > > > >> >> > > > > >> > at the broker side, but maybe
> > > users
> > > > > are
> > > > > > > > doing
> > > > > > > > > > > > something
> > > > > > > > > > > > > > >> >> wrong.
> > > > > > > > > > > > > > >> >> > So
> > > > > > > > > > > > > > >> >> > > > I'm
> > > > > > > > > > > > > > >> >> > > > > >> > wondering if we should further
> > > > > > distinguish
> > > > > > > > > those
> > > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > > >> >> > > > > >> between
> > > > > > > > > > > > > > >> >> > > > > >> > a) those that are caused by
> > Kafka
> > > > > > itself,
> > > > > > > > > e.g. a
> > > > > > > > > > > > broker
> > > > > > > > > > > > > > >> timed
> > > > > > > > > > > > > > >> >> > out
> > > > > > > > > > > > > > >> >> > > > and
> > > > > > > > > > > > > > >> >> > > > > >> > aborted a txn and later an
> > endTxn
> > > > > > request
> > > > > > > is
> > > > > > > > > > > > received,
> > > > > > > > > > > > > > >> and b)
> > > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > > >> >> > > > > user's
> > > > > > > > > > > > > > >> >> > > > > >> > API call pattern is incorrect,
> > > > causing
> > > > > > the
> > > > > > > > > > request
> > > > > > > > > > > > to be
> > > > > > > > > > > > > > >> >> > rejected
> > > > > > > > > > > > > > >> >> > > > with
> > > > > > > > > > > > > > >> >> > > > > >> an
> > > > > > > > > > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > > > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > > > > > > > >> >> > > > feels
> > > > > > > > > > > > > > >> >> > > > > >> to
> > > > > > > > > > > > > > >> >> > > > > >> > me more like for case a), but
> > not
> > > > case
> > > > > > b).
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM
> > > > Boyang
> > > > > > > Chen <
> > > > > > > > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your
> > > > > proposal
> > > > > > > > makes
> > > > > > > > > > > sense
> > > > > > > > > > > > as
> > > > > > > > > > > > > > >> well,
> > > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > >> >> > > > > pro
> > > > > > > > > > > > > > >> >> > > > > >> side
> > > > > > > > > > > > > > >> >> > > > > >> >> we could have a universally
> > > agreed
> > > > > > > > exception
> > > > > > > > > > type
> > > > > > > > > > > > to be
> > > > > > > > > > > > > > >> >> caught
> > > > > > > > > > > > > > >> >> > by
> > > > > > > > > > > > > > >> >> > > > the
> > > > > > > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > > > > > > >> >> > > > > >> >> without having an extra layer
> > on
> > > > top
> > > > > of
> > > > > > > the
> > > > > > > > > > > actual
> > > > > > > > > > > > > > >> >> exceptions.
> > > > > > > > > > > > > > >> >> > I
> > > > > > > > > > > > > > >> >> > > > > could
> > > > > > > > > > > > > > >> >> > > > > >> see
> > > > > > > > > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > >> >> > > > > >> >> 1. The exception hierarchy
> will
> > > be
> > > > > more
> > > > > > > > > > complex.
> > > > > > > > > > > > Right
> > > > > > > > > > > > > > >> now
> > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > >> >> > > have
> > > > > > > > > > > > > > >> >> > > > > >> >> RetriableException type, if
> we
> > > are
> > > > > > going
> > > > > > > to
> > > > > > > > > > add a
> > > > > > > > > > > > > > >> >> > > > > >> >> `ProducerRetriableException`
> > > type,
> > > > we
> > > > > > > have
> > > > > > > > to
> > > > > > > > > > put
> > > > > > > > > > > > this
> > > > > > > > > > > > > > >> new
> > > > > > > > > > > > > > >> >> > > > interface
> > > > > > > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > > > > > > >> >> > > > > >> >> parent of the
> > RetriableException,
> > > > > > because
> > > > > > > > not
> > > > > > > > > > all
> > > > > > > > > > > > > > thrown
> > > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > > >> >> > > > > >> >> exceptions are `retriable` in
> > > > general
> > > > > > for
> > > > > > > > > > > > producer, for
> > > > > > > > > > > > > > >> >> example
> > > > > > > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > > > > > > >> >> > > > > >> >> We could have an empty
> > interface
> > > > > > > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > > >> >> > > > let
> > > > > > > > > > > > > > >> >> > > > > >> all
> > > > > > > > > > > > > > >> >> > > > > >> >> the thrown exceptions
> implement
> > > for
> > > > > > sure,
> > > > > > > > but
> > > > > > > > > > > it's
> > > > > > > > > > > > a
> > > > > > > > > > > > > > bit
> > > > > > > > > > > > > > >> >> > > unorthodox
> > > > > > > > > > > > > > >> >> > > > > in
> > > > > > > > > > > > > > >> >> > > > > >> the
> > > > > > > > > > > > > > >> >> > > > > >> >> way we deal with exceptions
> in
> > > > > general.
> > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > >> >> > > > > >> >> 2. There are cases where we
> > > throw a
> > > > > > > > > > > KafkaException
> > > > > > > > > > > > > > >> wrapping
> > > > > > > > > > > > > > >> >> > > another
> > > > > > > > > > > > > > >> >> > > > > >> >> KafkaException as either
> fatal
> > or
> > > > > > > > non-fatal.
> > > > > > > > > If
> > > > > > > > > > > we
> > > > > > > > > > > > use
> > > > > > > > > > > > > > an
> > > > > > > > > > > > > > >> >> > > interface
> > > > > > > > > > > > > > >> >> > > > > to
> > > > > > > > > > > > > > >> >> > > > > >> >> solve #1, it is also required
> > to
> > > > > > > implement
> > > > > > > > > > > another
> > > > > > > > > > > > > > >> bloated
> > > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > >> >> > > > > >> >> which could replace
> > > KafkaException
> > > > > > type,
> > > > > > > as
> > > > > > > > > we
> > > > > > > > > > > > couldn't
> > > > > > > > > > > > > > >> mark
> > > > > > > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > >> >> > > > > >> >> 3. In terms of the
> > encapsulation,
> > > > > > > wrapping
> > > > > > > > > > means
> > > > > > > > > > > we
> > > > > > > > > > > > > > could
> > > > > > > > > > > > > > >> >> limit
> > > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > > > > > > >> >> > > > > >> >> of affection to the producer
> > > only,
> > > > > > which
> > > > > > > is
> > > > > > > > > > > > important
> > > > > > > > > > > > > > >> since
> > > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > > >> >> > > > don't
> > > > > > > > > > > > > > >> >> > > > > >> want
> > > > > > > > > > > > > > >> >> > > > > >> >> shared exception types to
> > > > implement a
> > > > > > > > > > > > producer-related
> > > > > > > > > > > > > > >> >> > interface,
> > > > > > > > > > > > > > >> >> > > > > such
> > > > > > > > > > > > > > >> >> > > > > >> >> as
> > > > UnknownTopicOrPartitionException.
> > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38
> PM
> > > > > Matthias
> > > > > > > J.
> > > > > > > > > Sax
> > > > > > > > > > <
> > > > > > > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > >> >> > > > > >> >>> Overall, categorizing
> > exceptions
> > > > > > makes a
> > > > > > > > lot
> > > > > > > > > > of
> > > > > > > > > > > > sense.
> > > > > > > > > > > > > > >> As I
> > > > > > > > > > > > > > >> >> > > don't
> > > > > > > > > > > > > > >> >> > > > > know
> > > > > > > > > > > > > > >> >> > > > > >> >>> the producer internals well
> > > > enough,
> > > > > I
> > > > > > > > cannot
> > > > > > > > > > > > comment
> > > > > > > > > > > > > > on
> > > > > > > > > > > > > > >> the
> > > > > > > > > > > > > > >> >> > > > > >> >>> categorization in detail
> > though.
> > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > >> >> > > > > >> >>> What I am wondering is, if
> we
> > > > should
> > > > > > > > > introduce
> > > > > > > > > > > an
> > > > > > > > > > > > > > >> exception
> > > > > > > > > > > > > > >> >> > > > > interface
> > > > > > > > > > > > > > >> >> > > > > >> >>> that non-fatal exception
> would
> > > > > > implement
> > > > > > > > > > instead
> > > > > > > > > > > > of
> > > > > > > > > > > > > > >> >> creating a
> > > > > > > > > > > > > > >> >> > > new
> > > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal
> > > > exceptions?
> > > > > > > What
> > > > > > > > > > would
> > > > > > > > > > > > be the
> > > > > > > > > > > > > > >> >> > pros/cons
> > > > > > > > > > > > > > >> >> > > > for
> > > > > > > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang
> > Chen
> > > > > > wrote:
> > > > > > > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > >> >> > > > > >> >>>> I would like to start a
> > > > discussion
> > > > > > > thread
> > > > > > > > > for
> > > > > > > > > > > > > > KIP-691:
> > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to
> simplify
> > > the
> > > > > > > > exception
> > > > > > > > > > > > handling
> > > > > > > > > > > > > > >> logic
> > > > > > > > > > > > > > >> >> > for
> > > > > > > > > > > > > > >> >> > > > > >> >>>> transactional Producer
> users
> > by
> > > > > > > > classifying
> > > > > > > > > > > > fatal and
> > > > > > > > > > > > > > >> >> > non-fatal
> > > > > > > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > > > > > > >> >> > > > > >> >>>> and throw them
> > correspondingly
> > > > for
> > > > > > > easier
> > > > > > > > > > catch
> > > > > > > > > > > > and
> > > > > > > > > > > > > > >> retry.
> > > > > > > > > > > > > > >> >> > Let
> > > > > > > > > > > > > > >> >> > > me
> > > > > > > > > > > > > > >> >> > > > > >> know
> > > > > > > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > > > --
> > > > > > > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > >> >> > --
> > > > > > > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >> > --
> > > > > > > > > > > > > > >> > -- Guozhang
> > > > > > > > > > > > > > >> >
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> --
> > > > > > > > > > > > > > >> -- Guozhang
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > -- Guozhang
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Boyang,

Ok, if we are going to ask all users to update their code anyway, maybe we
can get more bang out of this.

One of the remaining problems is that a user will only get the richer
information if they are supplying a Callback. If they wait on the future,
then they will get the usual exception. Our hands are somewhat tied here
because of compatibility. One thing we have wanted to do for a long time is
introduce a new send method which returns
`CompletableFuture<RecordMetadata>` instead. If we did so, then we wouldn't
need to change `Callback` at all. Instead we could make the contract of the
new api that it will always return `SendFailedException`, which could
provide a method to get the failure type.

I think Chia-Ping has been thinking about this:
https://issues.apache.org/jira/browse/KAFKA-12227. Is there a way to
consolidate our efforts in order to avoid unnecessary noise for users? It
would be awesome to see a more modern API which also addresses the
confusing error handling.

-Jason







On Fri, Jan 29, 2021 at 3:48 PM Boyang Chen <re...@gmail.com>
wrote:

> Thanks Jason, my hope is that we could streamline the callback function
> signature, so that in the next major release (3.0),
> onCompletion(RecordMetadata metadata, SendFailure sendFailure) will become
> the only API to be implemented.
>
> On Fri, Jan 29, 2021 at 3:42 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Boyang,
> >
> > Do you think it's necessary to deprecate the old `onCompletion`
> callback? I
> > was thinking there's probably no harm leaving it around. Users might not
> > care about the failure type. Other than that, it looks good to me.
> >
> > Thanks,
> > Jason
> >
> > On Thu, Jan 28, 2021 at 6:58 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Hey Guozhang,
> > >
> > > the CommitFailedException wrapping would cover all the non-fatal
> > exceptions
> > > as we listed out in the KIP, generally speaking any exception that
> could
> > > recover safely by calling abortTxn should be wrapped.
> > >
> > > Best,
> > > Boyang
> > >
> > > On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Hey Boyang,
> > > >
> > > > I think maybe there's something cracking here :) I'm just asking for
> > > > clarifications that as of today, which non-fatal exceptions the newly
> > > > introduced CommitFailedException would cover, and it seems to be only
> > 1)
> > > > unknown pid, 2) invalid pid mapping, and 3) concurrent transactions.
> Is
> > > > that correct?
> > > >
> > > > On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > Hey Guozhang, I think TimeoutException would not be covered here as
> > it
> > > > > potentially has a risk of hitting an illegal state on the broker
> side
> > > > when
> > > > > the previous commit was actually successful. Users should try to
> > > increase
> > > > > their max.block.ms to avoid hitting the timeout as a base
> > suggestion,
> > > > > which
> > > > > is discussed in the KIP.
> > > > >
> > > > > As for the umbrella exception, I agree it has values to some
> extent,
> > > but
> > > > in
> > > > > terms of adoption difficulty, additional exception types usually
> make
> > > the
> > > > > error handling more complex than simplified, and we are doing our
> > best
> > > to
> > > > > avoid compatibility issues caused by wrapping previously thrown raw
> > > > > exceptions that invalidates user error handling. And I'm also not
> in
> > > > favor
> > > > > of our exception hierarchy today where all exceptions are
> subclasses
> > of
> > > > > KafkaException, and we already mixed its handling with other
> specific
> > > > > exception types that could be caught as superclass KafkaException.
> > That
> > > > > could be addressed in a separate proposal if we see necessary.
> > > > >
> > > > > Boyang
> > > > >
> > > > > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <wa...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Thanks Jason for the suggestion, that looks good to me too.
> > > > > >
> > > > > > Regarding the non-fatal exceptions wrapped as CommitFailed, I
> would
> > > > like
> > > > > to
> > > > > > clarify if we would cover all the following cases: 1) timeout, 2)
> > > > unknown
> > > > > > pid, 3) invalid pid mapping, 4) concurrent transactions?
> > > > > >
> > > > > > BTW I think it still makes sense to use an umbrella exception in
> > case
> > > > in
> > > > > > the future we add more non-fatal cases even though today we only
> > > have a
> > > > > > few.
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <
> > > > reluctanthero104@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks Jason, I agree with the proposed solution here, will
> > update
> > > > the
> > > > > > KIP.
> > > > > > >
> > > > > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <
> > > jason@confluent.io
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Boyang,
> > > > > > > >
> > > > > > > > It seems like a reasonable suggestion. I wonder if a flag is
> > > > > sufficient
> > > > > > > > though. The current `Callback` documentation treats "fatal"
> > > errors
> > > > > from
> > > > > > > the
> > > > > > > > perspective of the individual message that was sent.
> > > > > > > >
> > > > > > > > ```
> > > > > > > >     *                  Non-Retriable exceptions (fatal, the
> > > message
> > > > > > will
> > > > > > > > never be sent):
> > > > > > > > ```
> > > > > > > >
> > > > > > > > However, we also have fatal errors from the perspective of
> the
> > > > > > > transaction
> > > > > > > > (e.g. when the producer gets fenced). Perhaps that suggests
> we
> > > need
> > > > > > > > something richer than a boolean flag:
> > > > > > > >
> > > > > > > > At a high level, I think the following cases are possible:
> > > > > > > >
> > > > > > > > - message rejected (e.g. message too large, invalid topic)
> > > > > > > > - delivery failed after retries/delivery timeout (e.g.
> timeout,
> > > crc
> > > > > > > error,
> > > > > > > > not enough replicas)
> > > > > > > > - transaction failed (e.g. producer fenced, invalid
> transaction
> > > > > state)
> > > > > > > >
> > > > > > > > Perhaps instead we can have a type like the following:
> > > > > > > >
> > > > > > > > class SendFailure {
> > > > > > > >   FailureType failureType;
> > > > > > > >   Exception cause;
> > > > > > > > }
> > > > > > > >
> > > > > > > > enum FailureType {
> > > > > > > >   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> > > > > > > > }
> > > > > > > >
> > > > > > > > (Not married to any of these names, just a starting point.)
> > > > > > > >
> > > > > > > > Then we add a new `onCompletion` as you've suggested:
> > > > > > > >
> > > > > > > > default void onCompletion(RecordMetadata metadata,
> SendFailure
> > > > > > failure) {
> > > > > > > >   this.onCompletion(metadata, failure.cause());
> > > > > > > > }
> > > > > > > >
> > > > > > > > This would give streams and other applications enough
> > information
> > > > to
> > > > > > know
> > > > > > > > whether the message can be retried and whether the
> transaction
> > > can
> > > > be
> > > > > > > > aborted.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > -Jason
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > > > > > reluctanthero104@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Jason for the thoughts.
> > > > > > > > >
> > > > > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <
> > > > > jason@confluent.io
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Boyang,
> > > > > > > > > >
> > > > > > > > > > Thanks for the iterations here. I think this is something
> > we
> > > > > should
> > > > > > > > have
> > > > > > > > > > done a long time ago. It sounds like there are two API
> > > changes
> > > > > > here:
> > > > > > > > > >
> > > > > > > > > > 1. We are introducing the `CommitFailedException` to wrap
> > > > > abortable
> > > > > > > > > > errors that are raised from `commitTransaction`. This
> > sounds
> > > > fine
> > > > > > to
> > > > > > > > me.
> > > > > > > > > As
> > > > > > > > > > far as I know, the only case we might need this is when
> we
> > > add
> > > > > > > support
> > > > > > > > to
> > > > > > > > > > let producers recover from coordinator timeouts. Are
> there
> > > any
> > > > > > > others?
> > > > > > > > > >
> > > > > > > > > > I think the purpose here is to ensure non-fatal
> exceptions
> > > are
> > > > > > > unified
> > > > > > > > > under the same
> > > > > > > > > exception umbrella, to make the proceeding to abort any
> > ongoing
> > > > > > > > transaction
> > > > > > > > > much easier.
> > > > > > > > > I don't think `coordinator timeouts` is the only case to
> > > recover
> > > > > > here,
> > > > > > > > > since we have other
> > > > > > > > > non-fatal exceptions such as UnknownPid.
> > > > > > > > >
> > > > > > > > > 2. We are wrapping non-fatal errors raised from `send` as
> > > > > > > > `KafkaException`.
> > > > > > > > > > The motivation for this is less clear to me and it
> doesn't
> > > look
> > > > > > like
> > > > > > > > the
> > > > > > > > > > example from the KIP depends on it. My concern here is
> > > > > > compatibility.
> > > > > > > > > > Currently we have the following documentation for the
> > > > `Callback`
> > > > > > API:
> > > > > > > > > >
> > > > > > > > > > ```
> > > > > > > > > >      *                  Non-Retriable exceptions (fatal,
> > the
> > > > > > message
> > > > > > > > will
> > > > > > > > > > never be sent):
> > > > > > > > > >      *
> > > > > > > > > >      *                  InvalidTopicException
> > > > > > > > > >      *                  OffsetMetadataTooLargeException
> > > > > > > > > >      *                  RecordBatchTooLargeException
> > > > > > > > > >      *                  RecordTooLargeException
> > > > > > > > > >      *                  UnknownServerException
> > > > > > > > > >      *                  UnknownProducerIdException
> > > > > > > > > >      *                  InvalidProducerEpochException
> > > > > > > > > >      *
> > > > > > > > > >      *                  Retriable exceptions (transient,
> > may
> > > be
> > > > > > > covered
> > > > > > > > > by
> > > > > > > > > > increasing #.retries):
> > > > > > > > > >      *
> > > > > > > > > >      *                  CorruptRecordException
> > > > > > > > > >      *                  InvalidMetadataException
> > > > > > > > > >      *
> > NotEnoughReplicasAfterAppendException
> > > > > > > > > >      *                  NotEnoughReplicasException
> > > > > > > > > >      *                  OffsetOutOfRangeException
> > > > > > > > > >      *                  TimeoutException
> > > > > > > > > >      *                  UnknownTopicOrPartitionException
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > If we wrap all the retriable exceptions documented here
> as
> > > > > > > > > > `KafkaException`, wouldn't that break any error handling
> > that
> > > > > users
> > > > > > > > might
> > > > > > > > > > already have? it's gonna introduce a compatibility issue.
> > > > > > > > > >
> > > > > > > > > > The original intention was to simplify `send` callback
> > error
> > > > > > handling
> > > > > > > > by
> > > > > > > > > doing exception wrapping, as on Streams level
> > > > > > > > > we have to prepare an exhausting list of exceptions to
> catch
> > as
> > > > > > fatal,
> > > > > > > > and
> > > > > > > > > the same lengthy list to catch as
> > > > > > > > > non-fatal. It would be much easier if we got `hints` from
> the
> > > > > > callback.
> > > > > > > > > However,
> > > > > > > > > I agree there is a compatibility concern, what about
> > > deprecating
> > > > > the
> > > > > > > > > existing:
> > > > > > > > >
> > > > > > > > > void onCompletion(RecordMetadata metadata, Exception
> > exception)
> > > > > > > > >
> > > > > > > > > and replace it with:
> > > > > > > > >
> > > > > > > > > default void onCompletion(RecordMetadata metadata,
> Exception
> > > > > > exception,
> > > > > > > > > boolean isFatal) {
> > > > > > > > >   this.onCompletion(metadata, exception);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > to make sure new users get the benefit of understanding the
> > > > > fatality
> > > > > > > > based
> > > > > > > > > on the info presented by the producer?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > > Jason
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <
> > > info@hiringuru.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Why  we are receiving all emails kindly remove us from
> > > > > > > > > > > dev@kafka.apache.org we don't want to receive emails
> > > > anymore.
> > > > > > > > > > >
> > > > > > > > > > > Thanks
> > > > > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <
> > wangguoz@gmail.com>
> > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks Boyang, yes I think I was confused about the
> > > > different
> > > > > > > > > handling
> > > > > > > > > > of
> > > > > > > > > > > > two abortTxn calls, and now I get it was not
> > > intentional. I
> > > > > > > think I
> > > > > > > > > do
> > > > > > > > > > > not
> > > > > > > > > > > > have more concerns.
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the clarification Guozhang, I got your
> > point
> > > > > that
> > > > > > we
> > > > > > > > > want
> > > > > > > > > > to
> > > > > > > > > > > > > have a consistent handling of fatal exceptions
> being
> > > > thrown
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > > > abortTxn. I modified the current template to move
> the
> > > > fatal
> > > > > > > > > exception
> > > > > > > > > > > > > try-catch outside of the processing loop to make
> sure
> > > we
> > > > > > could
> > > > > > > > get
> > > > > > > > > a
> > > > > > > > > > > chance
> > > > > > > > > > > > > to close consumer/producer modules. Let me know
> what
> > > you
> > > > > > think.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Best,
> > > > > > > > > > > > > Boyang
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > My understanding is that abortTransaction would
> > only
> > > > > throw
> > > > > > > when
> > > > > > > > > the
> > > > > > > > > > > > > > producer is in fatal state. For CommitFailed, the
> > > > > producer
> > > > > > > > should
> > > > > > > > > > > still
> > > > > > > > > > > > > be
> > > > > > > > > > > > > > in the abortable error state, so that
> > > abortTransaction
> > > > > call
> > > > > > > > would
> > > > > > > > > > not
> > > > > > > > > > > > > throw.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <
> > > > > > > > > wangguoz@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >> Or are you going to maintain some internal state
> > > such
> > > > > > that,
> > > > > > > > the
> > > > > > > > > > > > > >> `abortTransaction` in the catch block would
> never
> > > > throw
> > > > > > > again?
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > I've also thought about this (i.e. using
> > > > CommitFailed
> > > > > > for
> > > > > > > > all
> > > > > > > > > > > > > >> non-fatal),
> > > > > > > > > > > > > >> > but what I'm pondering is that, in the catch
> > > > > > > (CommitFailed)
> > > > > > > > > > block,
> > > > > > > > > > > > > what
> > > > > > > > > > > > > >> > would happen if the
> > `producer.abortTransaction();`
> > > > > > throws
> > > > > > > > > again?
> > > > > > > > > > > > > should
> > > > > > > > > > > > > >> > that be captured as a fatal and cause the
> client
> > > to
> > > > > > close
> > > > > > > > > again.
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > If yes, then naively the pattern would be:
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > > > > > >> >         // Transaction commit failed with
> > > abortable
> > > > > > error,
> > > > > > > > > user
> > > > > > > > > > > could
> > > > > > > > > > > > > >> reset
> > > > > > > > > > > > > >> >         // the application state and resume
> > with a
> > > > new
> > > > > > > > > > > transaction.
> > > > > > > > > > > > > The
> > > > > > > > > > > > > >> > root
> > > > > > > > > > > > > >> >         // cause was wrapped in the thrown
> > > > exception.
> > > > > > > > > > > > > >> >
>  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > >> >         try {
> > > > > > > > > > > > > >> >             producer.abortTransaction();
> > > > > > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > > > > > >> >             producer.close();
> > > > > > > > > > > > > >> >             consumer.close();
> > > > > > > > > > > > > >> >             throw e;
> > > > > > > > > > > > > >> >         }
> > > > > > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > > > > > >> >         producer.close();
> > > > > > > > > > > > > >> >         consumer.close();
> > > > > > > > > > > > > >> >         throw e;
> > > > > > > > > > > > > >> >     }
> > > > > > > > > > > > > >> > ...
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > Guozhang
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > > > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> Jason and I were discussing the new API
> offline
> > > and
> > > > > > > decided
> > > > > > > > > to
> > > > > > > > > > > take
> > > > > > > > > > > > > >> >> another
> > > > > > > > > > > > > >> >> approach. Firstly, the reason not to invent a
> > new
> > > > API
> > > > > > > with
> > > > > > > > > > > returned
> > > > > > > > > > > > > >> >> boolean
> > > > > > > > > > > > > >> >> flag is for compatibility consideration,
> since
> > > old
> > > > > EOS
> > > > > > > code
> > > > > > > > > > > would not
> > > > > > > > > > > > > >> know
> > > > > > > > > > > > > >> >> that a given transaction commit was failed
> > > > internally
> > > > > > as
> > > > > > > > they
> > > > > > > > > > > don't
> > > > > > > > > > > > > >> listen
> > > > > > > > > > > > > >> >> to the output flag. Our proposed alternative
> > > > solution
> > > > > > is
> > > > > > > to
> > > > > > > > > let
> > > > > > > > > > > > > >> >> *commitTransaction
> > > > > > > > > > > > > >> >> throw CommitFailedException whenever the
> commit
> > > > > failed
> > > > > > > with
> > > > > > > > > > > non-fatal
> > > > > > > > > > > > > >> >> exception*, so that on the caller side the
> > > handling
> > > > > > logic
> > > > > > > > > > > becomes:
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> try {
> > > > > > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > > > > > >> >>             producer.commitTransaction();
> > > > > > > > > > > > > >> >>         } else {
> > > > > > > > > > > > > >> >>
> > > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > >> >>             producer.abortTransaction();
> > > > > > > > > > > > > >> >>         }
> > > > > > > > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > > > > > > > >> >>         // Transaction commit failed with
> > > abortable
> > > > > > > error,
> > > > > > > > > user
> > > > > > > > > > > could
> > > > > > > > > > > > > >> >> reset
> > > > > > > > > > > > > >> >>         // the application state and resume
> > with
> > > a
> > > > > new
> > > > > > > > > > > transaction.
> > > > > > > > > > > > > The
> > > > > > > > > > > > > >> >> root
> > > > > > > > > > > > > >> >>         // cause was wrapped in the thrown
> > > > exception.
> > > > > > > > > > > > > >> >>
> >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > > > > > >> >>         producer.close();
> > > > > > > > > > > > > >> >>         consumer.close();
> > > > > > > > > > > > > >> >>         throw e;
> > > > > > > > > > > > > >> >>     }
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> This approach looks cleaner as all exception
> > > types
> > > > > > other
> > > > > > > > than
> > > > > > > > > > > > > >> CommitFailed
> > > > > > > > > > > > > >> >> will doom to be fatal, which is very easy to
> > > adopt
> > > > > for
> > > > > > > > users.
> > > > > > > > > > In
> > > > > > > > > > > the
> > > > > > > > > > > > > >> >> meantime, we still maintain the commitTxn
> > > behavior
> > > > to
> > > > > > > throw
> > > > > > > > > > > instead
> > > > > > > > > > > > > of
> > > > > > > > > > > > > >> >> silently failing.
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> In addition, we decided to drop the
> > > recommendation
> > > > to
> > > > > > > > handle
> > > > > > > > > > > > > >> >> TimeoutException and leave it to the users to
> > > make
> > > > > the
> > > > > > > > call.
> > > > > > > > > > The
> > > > > > > > > > > > > >> downside
> > > > > > > > > > > > > >> >> for blindly calling abortTxn upon timeout is
> > that
> > > > we
> > > > > > > could
> > > > > > > > > > > result in
> > > > > > > > > > > > > an
> > > > > > > > > > > > > >> >> illegal state when the commit was already
> > > > successful
> > > > > on
> > > > > > > the
> > > > > > > > > > > broker
> > > > > > > > > > > > > >> >> side. Without a good guarantee on the
> outcome,
> > > > > > > > > overcomplicating
> > > > > > > > > > > the
> > > > > > > > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> Let me know your thoughts on the new approach
> > > here,
> > > > > > thank
> > > > > > > > > you!
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> Best,
> > > > > > > > > > > > > >> >> Boyang
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang
> Wang
> > <
> > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > >> >> wrote:
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >> > Thanks for your clarification on 2)/3),
> that
> > > > makes
> > > > > > > sense.
> > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang
> Chen
> > <
> > > > > > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > >> >> > > Thanks for the input Guozhang, replied
> > > inline.
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang
> > > Wang <
> > > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > Thanks for the updated KIP. I read it
> > again
> > > > and
> > > > > > > have
> > > > > > > > > the
> > > > > > > > > > > > > >> following
> > > > > > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > 0. I'm a bit concerned that if
> commitTxn
> > > does
> > > > > not
> > > > > > > > throw
> > > > > > > > > > any
> > > > > > > > > > > > > >> >> non-fatal
> > > > > > > > > > > > > >> >> > > > exception, and instead we rely on the
> > > > > subsequent
> > > > > > > > > beginTxn
> > > > > > > > > > > call
> > > > > > > > > > > > > to
> > > > > > > > > > > > > >> >> > throw,
> > > > > > > > > > > > > >> >> > > it
> > > > > > > > > > > > > >> >> > > > may violate some callers with a pattern
> > > that
> > > > > > > relying
> > > > > > > > on
> > > > > > > > > > > > > >> commitTxn to
> > > > > > > > > > > > > >> >> > > > succeed to make some non-rollback
> > > operations.
> > > > > For
> > > > > > > > > > example:
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > > > > > >> >> > > > // if commitTxn succeeds, then commit
> the
> > > DB
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > -------------
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > The issue is that, committing DB is a
> > > > > > non-rollback
> > > > > > > > > > > operation,
> > > > > > > > > > > > > and
> > > > > > > > > > > > > >> >> users
> > > > > > > > > > > > > >> >> > > may
> > > > > > > > > > > > > >> >> > > > just rely on commitTxn to return
> without
> > > > error
> > > > > to
> > > > > > > > make
> > > > > > > > > > this
> > > > > > > > > > > > > >> >> > non-rollback
> > > > > > > > > > > > > >> >> > > > call. Of course we can just claim this
> > > > pattern
> > > > > is
> > > > > > > not
> > > > > > > > > > > > > legitimate
> > > > > > > > > > > > > >> >> and is
> > > > > > > > > > > > > >> >> > > not
> > > > > > > > > > > > > >> >> > > > the right way of doing things, but many
> > > users
> > > > > may
> > > > > > > > > > naturally
> > > > > > > > > > > > > adopt
> > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > >> >> > > > pattern.
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > So maybe we should still let commitTxn
> > also
> > > > > throw
> > > > > > > > > > non-fatal
> > > > > > > > > > > > > >> >> exceptions,
> > > > > > > > > > > > > >> >> > > in
> > > > > > > > > > > > > >> >> > > > which case we would then call abortTxn
> > > again.
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > But if we do this, the pattern becomes:
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > > > > > >> >> > > >     try {
> > > > > > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > > > > > >> >> > > >     } catch (...) {        // enumerate
> > all
> > > > > fatal
> > > > > > > > > > > exceptions
> > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > Thank makes me think, the alternative
> we
> > > have
> > > > > > > > discussed
> > > > > > > > > > > offline
> > > > > > > > > > > > > >> may
> > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > >> >> > > > better: let commitTxn() to return a
> > boolean
> > > > > flag.
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > * If it returns true, it means the
> commit
> > > > > > > succeeded.
> > > > > > > > > > Users
> > > > > > > > > > > can
> > > > > > > > > > > > > >> >> > > comfortably
> > > > > > > > > > > > > >> >> > > > continue and do any external
> non-rollback
> > > > > > > operations
> > > > > > > > if
> > > > > > > > > > > they
> > > > > > > > > > > > > >> like.
> > > > > > > > > > > > > >> >> > > > * If it returns false, it means the
> > commit
> > > > > failed
> > > > > > > > with
> > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > >> >> error
> > > > > > > > > > > > > >> >> > > *AND
> > > > > > > > > > > > > >> >> > > > the txn has been aborted*. Users do not
> > > need
> > > > to
> > > > > > > call
> > > > > > > > > > > abortTxn
> > > > > > > > > > > > > >> again.
> > > > > > > > > > > > > >> >> > > > * If it throws, then it means fatal
> > errors.
> > > > > Users
> > > > > > > > > should
> > > > > > > > > > > shut
> > > > > > > > > > > > > >> down
> > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > >> >> > > > client.
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > > > > > > > > >> >> > > >     } else {
> > > > > > > > > > > > > >> >> > > >         // reset offsets, rollback
> > > > operations,
> > > > > > etc
> > > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > > > > > >> >> > > >    // do other non-rollbackable things
> > > > > > > > > > > > > >> >> > > > else
> > > > > > > > > > > > > >> >> > > >    // reset offsets, rollback
> operations,
> > > etc
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > Of course, if we want to have better
> > > > visibility
> > > > > > > into
> > > > > > > > > what
> > > > > > > > > > > > > caused
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> >> > > commit
> > > > > > > > > > > > > >> >> > > > to fail and txn to abort. We can let
> the
> > > > return
> > > > > > > type
> > > > > > > > be
> > > > > > > > > > an
> > > > > > > > > > > enum
> > > > > > > > > > > > > >> >> instead
> > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > >> >> > > > a flag. But my main idea is to still
> let
> > > the
> > > > > > > > commitTxn
> > > > > > > > > be
> > > > > > > > > > > the
> > > > > > > > > > > > > >> final
> > > > > > > > > > > > > >> >> > point
> > > > > > > > > > > > > >> >> > > > users can tell whether this txn
> succeeded
> > > or
> > > > > not,
> > > > > > > > > instead
> > > > > > > > > > > of
> > > > > > > > > > > > > >> >> relying on
> > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > I agree that adding a boolean flag is
> > > indeed
> > > > > > useful
> > > > > > > > in
> > > > > > > > > > this
> > > > > > > > > > > > > case.
> > > > > > > > > > > > > >> >> Will
> > > > > > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> > > 1. Re: "while maintaining the behavior to
> > > throw
> > > > > > fatal
> > > > > > > > > > > exception
> > > > > > > > > > > > > in
> > > > > > > > > > > > > >> >> raw"
> > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > >> >> > > > sure what do you mean by "throw" here.
> > Are
> > > > you
> > > > > > > > > proposing
> > > > > > > > > > > the
> > > > > > > > > > > > > >> >> callback
> > > > > > > > > > > > > >> >> > > would
> > > > > > > > > > > > > >> >> > > > *pass* (not throw) in any fatal
> > exceptions
> > > > when
> > > > > > > > > triggered
> > > > > > > > > > > > > without
> > > > > > > > > > > > > >> >> > > wrapping?
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > Yes, I want to say *pass*, the benefit
> is
> > > to
> > > > > make
> > > > > > > the
> > > > > > > > > end
> > > > > > > > > > > > > user's
> > > > > > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> > > > 2. I'm not sure I understand the claim
> > > > > regarding
> > > > > > > the
> > > > > > > > > > > callback
> > > > > > > > > > > > > >> that
> > > > > > > > > > > > > >> >> "In
> > > > > > > > > > > > > >> >> > > EOS
> > > > > > > > > > > > > >> >> > > > setup, it is not required to handle the
> > > > > > exception".
> > > > > > > > Are
> > > > > > > > > > you
> > > > > > > > > > > > > >> >> proposing
> > > > > > > > > > > > > >> >> > > that,
> > > > > > > > > > > > > >> >> > > > e.g. in Streams, we do not try to
> handle
> > > any
> > > > > > > > exceptions
> > > > > > > > > > if
> > > > > > > > > > > EOS
> > > > > > > > > > > > > is
> > > > > > > > > > > > > >> >> > enabled
> > > > > > > > > > > > > >> >> > > > in the callback anymore, but just let
> > > > > commitTxn()
> > > > > > > > > itself
> > > > > > > > > > to
> > > > > > > > > > > > > fail
> > > > > > > > > > > > > >> to
> > > > > > > > > > > > > >> >> be
> > > > > > > > > > > > > >> >> > > > notified about the problem? Personally
> I
> > > > think
> > > > > in
> > > > > > > > > Streams
> > > > > > > > > > > we
> > > > > > > > > > > > > >> should
> > > > > > > > > > > > > >> >> > just
> > > > > > > > > > > > > >> >> > > > make the handling logic of the callback
> > to
> > > be
> > > > > > > > > consistent
> > > > > > > > > > > > > >> regardless
> > > > > > > > > > > > > >> >> of
> > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > >> >> > > > EOS settings (today we have different
> > logic
> > > > > > > depending
> > > > > > > > > on
> > > > > > > > > > > this
> > > > > > > > > > > > > >> logic,
> > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > >> >> > > > I think could be unified as well).
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > My idea originates from the claim on
> send
> > > > API:
> > > > > > > > > > > > > >> >> > > "When used as part of a transaction, it
> is
> > > not
> > > > > > > > necessary
> > > > > > > > > to
> > > > > > > > > > > > > define
> > > > > > > > > > > > > >> a
> > > > > > > > > > > > > >> >> > > callback or check the result of the
> future
> > > in
> > > > > > order
> > > > > > > to
> > > > > > > > > > > detect
> > > > > > > > > > > > > >> errors
> > > > > > > > > > > > > >> >> > from
> > > > > > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > > > > > >> >> > > My understanding is that for EOS, the
> > > exception
> > > > > > will
> > > > > > > be
> > > > > > > > > > > detected
> > > > > > > > > > > > > by
> > > > > > > > > > > > > >> >> > calling
> > > > > > > > > > > > > >> >> > > transactional APIs either way, so it is a
> > > > > duplicate
> > > > > > > > > > handling
> > > > > > > > > > > to
> > > > > > > > > > > > > >> track
> > > > > > > > > > > > > >> >> > > all the sendExceptions in
> RecordCollector.
> > > > > > However, I
> > > > > > > > > > looked
> > > > > > > > > > > up
> > > > > > > > > > > > > >> >> > > sendException is being used today as
> > follow:
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler"
> for
> > > any
> > > > > > > default
> > > > > > > > > or
> > > > > > > > > > > > > >> customized
> > > > > > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > > > > > >> >> > > 2. Stop collecting offset info or new
> > > > exceptions
> > > > > > > > > > > > > >> >> > > 3. Check and rethrow exceptions in
> close(),
> > > > > flush()
> > > > > > > or
> > > > > > > > > new
> > > > > > > > > > > send()
> > > > > > > > > > > > > >> >> calls
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> > > To my understanding, we should still
> honor
> > > the
> > > > > > > > commitment
> > > > > > > > > > to
> > > > > > > > > > > #1
> > > > > > > > > > > > > for
> > > > > > > > > > > > > >> >> any
> > > > > > > > > > > > > >> >> > > user customized implementation. The #2
> does
> > > not
> > > > > > > really
> > > > > > > > > > > affect our
> > > > > > > > > > > > > >> >> > decision
> > > > > > > > > > > > > >> >> > > upon EOS. The #3 here is still valuable
> as
> > it
> > > > > could
> > > > > > > > help
> > > > > > > > > us
> > > > > > > > > > > fail
> > > > > > > > > > > > > >> fast
> > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > >> >> > > new send() instead of waiting to later
> > stage
> > > of
> > > > > > > > > processing.
> > > > > > > > > > > In
> > > > > > > > > > > > > that
> > > > > > > > > > > > > >> >> > sense,
> > > > > > > > > > > > > >> >> > > I agree we should continue to record send
> > > > > > exceptions
> > > > > > > > even
> > > > > > > > > > > under
> > > > > > > > > > > > > EOS
> > > > > > > > > > > > > >> >> case
> > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > >> >> > > ensure the strength of stream side
> Producer
> > > > > logic.
> > > > > > On
> > > > > > > > the
> > > > > > > > > > > safer
> > > > > > > > > > > > > >> side,
> > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > >> >> > no
> > > > > > > > > > > > > >> >> > > longer need to wrap certain fatal
> > exceptions
> > > > like
> > > > > > > > > > > ProducerFenced
> > > > > > > > > > > > > as
> > > > > > > > > > > > > >> >> > > TaskMigrated, since they should not crash
> > the
> > > > > > stream
> > > > > > > > > thread
> > > > > > > > > > > if
> > > > > > > > > > > > > >> thrown
> > > > > > > > > > > > > >> >> in
> > > > > > > > > > > > > >> >> > > raw format, once we adopt the new
> > processing
> > > > > model
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > send
> > > > > > > > > > > > > >> phase.
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang
> > > Chen <
> > > > > > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > > Thanks for everyone's feedback so
> far.
> > I
> > > > have
> > > > > > > > > polished
> > > > > > > > > > > the
> > > > > > > > > > > > > KIP
> > > > > > > > > > > > > >> >> after
> > > > > > > > > > > > > >> >> > > > > offline discussion with some folks
> > > working
> > > > on
> > > > > > EOS
> > > > > > > > to
> > > > > > > > > > > make the
> > > > > > > > > > > > > >> >> > exception
> > > > > > > > > > > > > >> >> > > > > handling more lightweight. The
> > essential
> > > > > change
> > > > > > > is
> > > > > > > > > that
> > > > > > > > > > > we
> > > > > > > > > > > > > are
> > > > > > > > > > > > > >> not
> > > > > > > > > > > > > >> >> > > > > inventing a new intermediate
> exception
> > > > type,
> > > > > > but
> > > > > > > > > > instead
> > > > > > > > > > > > > >> >> separating a
> > > > > > > > > > > > > >> >> > > > full
> > > > > > > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > >> >> > > > > This way for any exception thrown
> from
> > > > phase
> > > > > 1,
> > > > > > > > will
> > > > > > > > > be
> > > > > > > > > > > an
> > > > > > > > > > > > > >> >> indicator
> > > > > > > > > > > > > >> >> > in
> > > > > > > > > > > > > >> >> > > > > phase 2 whether we should do commit
> or
> > > > abort,
> > > > > > and
> > > > > > > > > from
> > > > > > > > > > > now on
> > > > > > > > > > > > > >> >> > > > > `commitTransaction` should only throw
> > > fatal
> > > > > > > > > exceptions,
> > > > > > > > > > > > > >> similar to
> > > > > > > > > > > > > >> >> > > > > `abortTransaction`, so that any
> > > > > KafkaException
> > > > > > > > caught
> > > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > > >> >> commit
> > > > > > > > > > > > > >> >> > > phase
> > > > > > > > > > > > > >> >> > > > > will be definitely fatal to crash the
> > > app.
> > > > > For
> > > > > > > more
> > > > > > > > > > > advanced
> > > > > > > > > > > > > >> users
> > > > > > > > > > > > > >> >> > such
> > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > >> >> > > > > Streams, we have the ability to
> further
> > > > wrap
> > > > > > > > selected
> > > > > > > > > > > types
> > > > > > > > > > > > > of
> > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > >> >> > > > > exceptions to trigger task migration
> if
> > > > > > > necessary.
> > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > >> >> > > > > More details in the KIP, feel free to
> > > take
> > > > > > > another
> > > > > > > > > > look,
> > > > > > > > > > > > > >> thanks!
> > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM
> Boyang
> > > > Chen <
> > > > > > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM
> Bruno
> > > > > Cadonna
> > > > > > <
> > > > > > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >> Like Matthias, I do also not know
> > the
> > > > > > producer
> > > > > > > > > > > internal
> > > > > > > > > > > > > well
> > > > > > > > > > > > > >> >> > enough
> > > > > > > > > > > > > >> >> > > to
> > > > > > > > > > > > > >> >> > > > > >> comment on the categorization.
> > > However,
> > > > I
> > > > > > > think
> > > > > > > > > > > having a
> > > > > > > > > > > > > >> super
> > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > >> >> > > > > >> (e.g. RetriableException) that
> > > encodes
> > > > if
> > > > > > an
> > > > > > > > > > > exception is
> > > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > >> >> > > > not
> > > > > > > > > > > > > >> >> > > > > >> is cleaner, better understandable
> > and
> > > > less
> > > > > > > > > > > error-prone,
> > > > > > > > > > > > > >> because
> > > > > > > > > > > > > >> >> > > > ideally
> > > > > > > > > > > > > >> >> > > > > >> when you add a new non-fatal
> > exception
> > > > in
> > > > > > > future
> > > > > > > > > you
> > > > > > > > > > > just
> > > > > > > > > > > > > >> need
> > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > >> >> > > > > >> about letting it inherit from the
> > > super
> > > > > > > > exception
> > > > > > > > > > and
> > > > > > > > > > > all
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> >> rest
> > > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > > >> >> > > > > >> the code will just behave
> correctly
> > > > > without
> > > > > > > the
> > > > > > > > > need
> > > > > > > > > > > to
> > > > > > > > > > > > > wrap
> > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > >> >> > new
> > > > > > > > > > > > > >> >> > > > > >> exception into another exception
> > each
> > > > time
> > > > > > it
> > > > > > > is
> > > > > > > > > > > thrown
> > > > > > > > > > > > > >> (maybe
> > > > > > > > > > > > > >> >> it
> > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > >> >> > > > > >> thrown at different location in
> the
> > > > code).
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >> As far as I understand the
> following
> > > > > > statement
> > > > > > > > > from
> > > > > > > > > > > your
> > > > > > > > > > > > > >> >> previous
> > > > > > > > > > > > > >> >> > > > e-mail
> > > > > > > > > > > > > >> >> > > > > >> is the reason that currently such
> a
> > > > super
> > > > > > > > > exception
> > > > > > > > > > > is not
> > > > > > > > > > > > > >> >> > possible:
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >> "Right now we have
> > RetriableException
> > > > > type,
> > > > > > if
> > > > > > > > we
> > > > > > > > > > are
> > > > > > > > > > > > > going
> > > > > > > > > > > > > >> to
> > > > > > > > > > > > > >> >> > add a
> > > > > > > > > > > > > >> >> > > > > >> `ProducerRetriableException` type,
> > we
> > > > have
> > > > > > to
> > > > > > > > put
> > > > > > > > > > > this new
> > > > > > > > > > > > > >> >> > interface
> > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > >> >> > > > > >> the parent of the
> > RetriableException,
> > > > > > because
> > > > > > > > not
> > > > > > > > > > all
> > > > > > > > > > > > > thrown
> > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > >> >> > > > > >> exceptions are `retriable` in
> > general
> > > > for
> > > > > > > > > producer"
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >> In the list of exceptions in your
> > > KIP, I
> > > > > > found
> > > > > > > > > > > non-fatal
> > > > > > > > > > > > > >> >> > exceptions
> > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > >> >> > > > > >> do not inherit from
> > > RetriableException.
> > > > I
> > > > > > > guess
> > > > > > > > > > those
> > > > > > > > > > > are
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> >> ones
> > > > > > > > > > > > > >> >> > > you
> > > > > > > > > > > > > >> >> > > > > >> are referring to in your
> statement:
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >> All of those exceptions are
> > non-fatal
> > > > and
> > > > > do
> > > > > > > not
> > > > > > > > > > > inherit
> > > > > > > > > > > > > >> from
> > > > > > > > > > > > > >> >> > > > > >> RetriableException. Is there a
> > reason
> > > > for
> > > > > > > that?
> > > > > > > > If
> > > > > > > > > > > they
> > > > > > > > > > > > > >> >> depended
> > > > > > > > > > > > > >> >> > > from
> > > > > > > > > > > > > >> >> > > > > >> RetriableException we would be a
> bit
> > > > > closer
> > > > > > > to a
> > > > > > > > > > super
> > > > > > > > > > > > > >> >> exception I
> > > > > > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >> The reason is that sender may
> catch
> > > > those
> > > > > > > > > exceptions
> > > > > > > > > > > in
> > > > > > > > > > > > > the
> > > > > > > > > > > > > >> >> > > > > > ProduceResponse, and it currently
> > does
> > > > > > infinite
> > > > > > > > > > > > > >> >> > > > > > retries on RetriableException. To
> > make
> > > > sure
> > > > > > we
> > > > > > > > > could
> > > > > > > > > > > > > trigger
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> >> > > > > > abortTransaction() by catching
> > > non-fatal
> > > > > > thrown
> > > > > > > > > > > > > >> >> > > > > > exceptions and reinitialize the txn
> > > > state,
> > > > > we
> > > > > > > > chose
> > > > > > > > > > > not to
> > > > > > > > > > > > > >> let
> > > > > > > > > > > > > >> >> > those
> > > > > > > > > > > > > >> >> > > > > > exceptions inherit
> RetriableException
> > > so
> > > > > that
> > > > > > > > > > > > > >> >> > > > > > they won't cause infinite retry on
> > > > sender.
> > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > >> >> > > > > >> With OutOfOrderSequenceException
> and
> > > > > > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > > >> >> > > > > >> to understand that their fatality
> > > > depends
> > > > > on
> > > > > > > the
> > > > > > > > > > type
> > > > > > > > > > > > > (i.e.
> > > > > > > > > > > > > >> >> > > > > >> configuration) of the producer.
> That
> > > > makes
> > > > > > it
> > > > > > > > > > > difficult to
> > > > > > > > > > > > > >> >> have a
> > > > > > > > > > > > > >> >> > > > super
> > > > > > > > > > > > > >> >> > > > > >> exception that encodes the
> > > retriability
> > > > as
> > > > > > > > > mentioned
> > > > > > > > > > > > > above.
> > > > > > > > > > > > > >> >> Would
> > > > > > > > > > > > > >> >> > it
> > > > > > > > > > > > > >> >> > > > be
> > > > > > > > > > > > > >> >> > > > > >> possible to introduce exceptions
> > that
> > > > > > inherit
> > > > > > > > from
> > > > > > > > > > > > > >> >> > > RetriableException
> > > > > > > > > > > > > >> >> > > > > >> and that are thrown when those
> > > > exceptions
> > > > > > are
> > > > > > > > > caught
> > > > > > > > > > > from
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> >> > > brokers
> > > > > > > > > > > > > >> >> > > > > >> and the type of the producer is
> such
> > > > that
> > > > > > the
> > > > > > > > > > > exceptions
> > > > > > > > > > > > > are
> > > > > > > > > > > > > >> >> > > > retriable?
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >> Yea, I think in general the
> > exception
> > > > type
> > > > > > > > mixing
> > > > > > > > > is
> > > > > > > > > > > > > >> difficult
> > > > > > > > > > > > > >> >> to
> > > > > > > > > > > > > >> >> > > get
> > > > > > > > > > > > > >> >> > > > > > them all right. I have already
> > proposed
> > > > > > another
> > > > > > > > > > > solution
> > > > > > > > > > > > > >> based
> > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > >> >> > my
> > > > > > > > > > > > > >> >> > > > > > offline discussion with some folks
> > > > working
> > > > > on
> > > > > > > EOS
> > > > > > > > > > > > > >> >> > > > > > to make the handling more
> > > straightforward
> > > > > for
> > > > > > > end
> > > > > > > > > > users
> > > > > > > > > > > > > >> without
> > > > > > > > > > > > > >> >> the
> > > > > > > > > > > > > >> >> > > > need
> > > > > > > > > > > > > >> >> > > > > > to distinguish exception fatality.
> > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang
> > > wrote:
> > > > > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the proposal!
> I
> > > > made a
> > > > > > > pass
> > > > > > > > > over
> > > > > > > > > > > the
> > > > > > > > > > > > > >> list
> > > > > > > > > > > > > >> >> and
> > > > > > > > > > > > > >> >> > > here
> > > > > > > > > > > > > >> >> > > > > are
> > > > > > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> > 0) Although this is not part of
> > the
> > > > > public
> > > > > > > > API,
> > > > > > > > > I
> > > > > > > > > > > think
> > > > > > > > > > > > > we
> > > > > > > > > > > > > >> >> > should
> > > > > > > > > > > > > >> >> > > > make
> > > > > > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > > > > > >> >> > > > > >> > that the suggested pattern (i.e.
> > > user
> > > > > can
> > > > > > > > always
> > > > > > > > > > > call
> > > > > > > > > > > > > >> >> abortTxn()
> > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > >> >> > > > > >> > handling non-fatal errors) are
> > > indeed
> > > > > > > > supported.
> > > > > > > > > > > E.g. if
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> >> txn
> > > > > > > > > > > > > >> >> > > is
> > > > > > > > > > > > > >> >> > > > > >> already
> > > > > > > > > > > > > >> >> > > > > >> > aborted by the broker side, then
> > > users
> > > > > can
> > > > > > > > still
> > > > > > > > > > > call
> > > > > > > > > > > > > >> >> abortTxn
> > > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > > >> >> > > > > >> would
> > > > > > > > > > > > > >> >> > > > > >> > not throw another exception but
> > just
> > > > be
> > > > > > > > treated
> > > > > > > > > > as a
> > > > > > > > > > > > > >> no-op.
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> > 1)
> > > *ConcurrentTransactionsException*:
> > > > I
> > > > > > > think
> > > > > > > > > this
> > > > > > > > > > > error
> > > > > > > > > > > > > >> can
> > > > > > > > > > > > > >> >> > also
> > > > > > > > > > > > > >> >> > > be
> > > > > > > > > > > > > >> >> > > > > >> > returned but not documented yet.
> > > This
> > > > > > should
> > > > > > > > be
> > > > > > > > > a
> > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > >> >> > error.
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*:
> > this
> > > > > error
> > > > > > is
> > > > > > > > > > > returned
> > > > > > > > > > > > > from
> > > > > > > > > > > > > >> >> > broker
> > > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > > > > > >> >> > > > > >> > state transition failed (e.g. it
> > is
> > > > > trying
> > > > > > > to
> > > > > > > > > > > transit to
> > > > > > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > > > > > >> >> > > > > >> > while the current state is not
> > > > > > > > prepare-commit).
> > > > > > > > > > This
> > > > > > > > > > > > > error
> > > > > > > > > > > > > >> >> could
> > > > > > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > > > > > >> >> > > > > >> > a bug on the client internal
> code
> > or
> > > > the
> > > > > > > > broker
> > > > > > > > > > > code,
> > > > > > > > > > > > > OR a
> > > > > > > > > > > > > >> >> user
> > > > > > > > > > > > > >> >> > > > error
> > > > > > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > > > > > >> >> > > > > >> > similar error is
> > > > > > > > > ConcurrentTransactionsException,
> > > > > > > > > > > i.e.
> > > > > > > > > > > > > if
> > > > > > > > > > > > > >> >> Kafka
> > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > > > > > >> >> > > > > >> > these exceptions would only be
> > > > returned
> > > > > if
> > > > > > > > users
> > > > > > > > > > > try to
> > > > > > > > > > > > > do
> > > > > > > > > > > > > >> >> > > something
> > > > > > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn right
> after
> > a
> > > > > > > commitTxn,
> > > > > > > > > > etc.
> > > > > > > > > > > So
> > > > > > > > > > > > > I'm
> > > > > > > > > > > > > >> >> > > thinking
> > > > > > > > > > > > > >> >> > > > it
> > > > > > > > > > > > > >> >> > > > > >> > should be a non-fatal error
> > instead
> > > > of a
> > > > > > > fatal
> > > > > > > > > > > error,
> > > > > > > > > > > > > >> wdyt?
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i
> > > "indicates
> > > > > > fatal
> > > > > > > > > > > > > transactional
> > > > > > > > > > > > > >> >> > > sequence
> > > > > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit
> > conflicting
> > > > with
> > > > > > the
> > > > > > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > > > > > >> >> > > > > >> that
> > > > > > > > > > > > > >> >> > > > > >> > is treated as non-fatal. I guess
> > > your
> > > > > > > proposal
> > > > > > > > > is
> > > > > > > > > > > that
> > > > > > > > > > > > > >> >> > > > > >> > OutOfOrderSequenceException
> would
> > be
> > > > > > treated
> > > > > > > > > > either
> > > > > > > > > > > as
> > > > > > > > > > > > > >> fatal
> > > > > > > > > > > > > >> >> > with
> > > > > > > > > > > > > >> >> > > > > >> > transactional producer, or
> > non-fatal
> > > > > with
> > > > > > > > > > idempotent
> > > > > > > > > > > > > >> >> producer,
> > > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > > >> >> > > > > >> > right? If the producer is only
> > > > > configured
> > > > > > > with
> > > > > > > > > > > > > idempotency
> > > > > > > > > > > > > >> >> but
> > > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > > > > > >> >> > > for
> > > > > > > > > > > > > >> >> > > > > >> > non-fatal errors would be
> > confusing
> > > > > since
> > > > > > > > users
> > > > > > > > > > are
> > > > > > > > > > > not
> > > > > > > > > > > > > >> using
> > > > > > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > > > > > >> >> > > > > >> > at all.. So I suggest we always
> > > throw
> > > > > > > > > > > > > >> OutOfSequenceException
> > > > > > > > > > > > > >> >> > as-is
> > > > > > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > > > > > >> >> > > > > >> > not wrapped) no matter how the
> > > > producer
> > > > > is
> > > > > > > > > > > configured,
> > > > > > > > > > > > > and
> > > > > > > > > > > > > >> >> let
> > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > > > > > >> >> > > > > >> > decide how to handle it based on
> > > > whether
> > > > > > it
> > > > > > > is
> > > > > > > > > > only
> > > > > > > > > > > > > >> >> idempotent
> > > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the
> > > > > `send()`
> > > > > > > > > > callback /
> > > > > > > > > > > > > >> future
> > > > > > > > > > > > > >> >> can
> > > > > > > > > > > > > >> >> > > also
> > > > > > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > > > > > >> >> > > > > >> > txn-related exceptions, I think
> > this
> > > > KIP
> > > > > > > > should
> > > > > > > > > > also
> > > > > > > > > > > > > cover
> > > > > > > > > > > > > >> >> this
> > > > > > > > > > > > > >> >> > > API
> > > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> > 5) This is related to 1/2)
> above:
> > > > > > sometimes
> > > > > > > > > those
> > > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > >> >> > > > > like
> > > > > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState
> > are
> > > > not
> > > > > > due
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > > > state
> > > > > > > > > > > > > >> >> being
> > > > > > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > > > > > >> >> > > > > >> > at the broker side, but maybe
> > users
> > > > are
> > > > > > > doing
> > > > > > > > > > > something
> > > > > > > > > > > > > >> >> wrong.
> > > > > > > > > > > > > >> >> > So
> > > > > > > > > > > > > >> >> > > > I'm
> > > > > > > > > > > > > >> >> > > > > >> > wondering if we should further
> > > > > distinguish
> > > > > > > > those
> > > > > > > > > > > > > non-fatal
> > > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > > >> >> > > > > >> between
> > > > > > > > > > > > > >> >> > > > > >> > a) those that are caused by
> Kafka
> > > > > itself,
> > > > > > > > e.g. a
> > > > > > > > > > > broker
> > > > > > > > > > > > > >> timed
> > > > > > > > > > > > > >> >> > out
> > > > > > > > > > > > > >> >> > > > and
> > > > > > > > > > > > > >> >> > > > > >> > aborted a txn and later an
> endTxn
> > > > > request
> > > > > > is
> > > > > > > > > > > received,
> > > > > > > > > > > > > >> and b)
> > > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > > >> >> > > > > user's
> > > > > > > > > > > > > >> >> > > > > >> > API call pattern is incorrect,
> > > causing
> > > > > the
> > > > > > > > > request
> > > > > > > > > > > to be
> > > > > > > > > > > > > >> >> > rejected
> > > > > > > > > > > > > >> >> > > > with
> > > > > > > > > > > > > >> >> > > > > >> an
> > > > > > > > > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > > > > > > >> >> > > > feels
> > > > > > > > > > > > > >> >> > > > > >> to
> > > > > > > > > > > > > >> >> > > > > >> > me more like for case a), but
> not
> > > case
> > > > > b).
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM
> > > Boyang
> > > > > > Chen <
> > > > > > > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your
> > > > proposal
> > > > > > > makes
> > > > > > > > > > sense
> > > > > > > > > > > as
> > > > > > > > > > > > > >> well,
> > > > > > > > > > > > > >> >> on
> > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > >> >> > > > > pro
> > > > > > > > > > > > > >> >> > > > > >> side
> > > > > > > > > > > > > >> >> > > > > >> >> we could have a universally
> > agreed
> > > > > > > exception
> > > > > > > > > type
> > > > > > > > > > > to be
> > > > > > > > > > > > > >> >> caught
> > > > > > > > > > > > > >> >> > by
> > > > > > > > > > > > > >> >> > > > the
> > > > > > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > > > > > >> >> > > > > >> >> without having an extra layer
> on
> > > top
> > > > of
> > > > > > the
> > > > > > > > > > actual
> > > > > > > > > > > > > >> >> exceptions.
> > > > > > > > > > > > > >> >> > I
> > > > > > > > > > > > > >> >> > > > > could
> > > > > > > > > > > > > >> >> > > > > >> see
> > > > > > > > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > >> >> > > > > >> >> 1. The exception hierarchy will
> > be
> > > > more
> > > > > > > > > complex.
> > > > > > > > > > > Right
> > > > > > > > > > > > > >> now
> > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > >> >> > > have
> > > > > > > > > > > > > >> >> > > > > >> >> RetriableException type, if we
> > are
> > > > > going
> > > > > > to
> > > > > > > > > add a
> > > > > > > > > > > > > >> >> > > > > >> >> `ProducerRetriableException`
> > type,
> > > we
> > > > > > have
> > > > > > > to
> > > > > > > > > put
> > > > > > > > > > > this
> > > > > > > > > > > > > >> new
> > > > > > > > > > > > > >> >> > > > interface
> > > > > > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > > > > > >> >> > > > > >> >> parent of the
> RetriableException,
> > > > > because
> > > > > > > not
> > > > > > > > > all
> > > > > > > > > > > > > thrown
> > > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > > >> >> > > > > >> >> exceptions are `retriable` in
> > > general
> > > > > for
> > > > > > > > > > > producer, for
> > > > > > > > > > > > > >> >> example
> > > > > > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > > > > > >> >> > > > > >> >> We could have an empty
> interface
> > > > > > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > > >> >> > > > let
> > > > > > > > > > > > > >> >> > > > > >> all
> > > > > > > > > > > > > >> >> > > > > >> >> the thrown exceptions implement
> > for
> > > > > sure,
> > > > > > > but
> > > > > > > > > > it's
> > > > > > > > > > > a
> > > > > > > > > > > > > bit
> > > > > > > > > > > > > >> >> > > unorthodox
> > > > > > > > > > > > > >> >> > > > > in
> > > > > > > > > > > > > >> >> > > > > >> the
> > > > > > > > > > > > > >> >> > > > > >> >> way we deal with exceptions in
> > > > general.
> > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > >> >> > > > > >> >> 2. There are cases where we
> > throw a
> > > > > > > > > > KafkaException
> > > > > > > > > > > > > >> wrapping
> > > > > > > > > > > > > >> >> > > another
> > > > > > > > > > > > > >> >> > > > > >> >> KafkaException as either fatal
> or
> > > > > > > non-fatal.
> > > > > > > > If
> > > > > > > > > > we
> > > > > > > > > > > use
> > > > > > > > > > > > > an
> > > > > > > > > > > > > >> >> > > interface
> > > > > > > > > > > > > >> >> > > > > to
> > > > > > > > > > > > > >> >> > > > > >> >> solve #1, it is also required
> to
> > > > > > implement
> > > > > > > > > > another
> > > > > > > > > > > > > >> bloated
> > > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > >> >> > > > > >> >> which could replace
> > KafkaException
> > > > > type,
> > > > > > as
> > > > > > > > we
> > > > > > > > > > > couldn't
> > > > > > > > > > > > > >> mark
> > > > > > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > >> >> > > > > >> >> 3. In terms of the
> encapsulation,
> > > > > > wrapping
> > > > > > > > > means
> > > > > > > > > > we
> > > > > > > > > > > > > could
> > > > > > > > > > > > > >> >> limit
> > > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > > > > > >> >> > > > > >> >> of affection to the producer
> > only,
> > > > > which
> > > > > > is
> > > > > > > > > > > important
> > > > > > > > > > > > > >> since
> > > > > > > > > > > > > >> >> we
> > > > > > > > > > > > > >> >> > > > don't
> > > > > > > > > > > > > >> >> > > > > >> want
> > > > > > > > > > > > > >> >> > > > > >> >> shared exception types to
> > > implement a
> > > > > > > > > > > producer-related
> > > > > > > > > > > > > >> >> > interface,
> > > > > > > > > > > > > >> >> > > > > such
> > > > > > > > > > > > > >> >> > > > > >> >> as
> > > UnknownTopicOrPartitionException.
> > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM
> > > > Matthias
> > > > > > J.
> > > > > > > > Sax
> > > > > > > > > <
> > > > > > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > >> >> > > > > >> >>> Overall, categorizing
> exceptions
> > > > > makes a
> > > > > > > lot
> > > > > > > > > of
> > > > > > > > > > > sense.
> > > > > > > > > > > > > >> As I
> > > > > > > > > > > > > >> >> > > don't
> > > > > > > > > > > > > >> >> > > > > know
> > > > > > > > > > > > > >> >> > > > > >> >>> the producer internals well
> > > enough,
> > > > I
> > > > > > > cannot
> > > > > > > > > > > comment
> > > > > > > > > > > > > on
> > > > > > > > > > > > > >> the
> > > > > > > > > > > > > >> >> > > > > >> >>> categorization in detail
> though.
> > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > >> >> > > > > >> >>> What I am wondering is, if we
> > > should
> > > > > > > > introduce
> > > > > > > > > > an
> > > > > > > > > > > > > >> exception
> > > > > > > > > > > > > >> >> > > > > interface
> > > > > > > > > > > > > >> >> > > > > >> >>> that non-fatal exception would
> > > > > implement
> > > > > > > > > instead
> > > > > > > > > > > of
> > > > > > > > > > > > > >> >> creating a
> > > > > > > > > > > > > >> >> > > new
> > > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal
> > > exceptions?
> > > > > > What
> > > > > > > > > would
> > > > > > > > > > > be the
> > > > > > > > > > > > > >> >> > pros/cons
> > > > > > > > > > > > > >> >> > > > for
> > > > > > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang
> Chen
> > > > > wrote:
> > > > > > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > >> >> > > > > >> >>>> I would like to start a
> > > discussion
> > > > > > thread
> > > > > > > > for
> > > > > > > > > > > > > KIP-691:
> > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify
> > the
> > > > > > > exception
> > > > > > > > > > > handling
> > > > > > > > > > > > > >> logic
> > > > > > > > > > > > > >> >> > for
> > > > > > > > > > > > > >> >> > > > > >> >>>> transactional Producer users
> by
> > > > > > > classifying
> > > > > > > > > > > fatal and
> > > > > > > > > > > > > >> >> > non-fatal
> > > > > > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > > > > > >> >> > > > > >> >>>> and throw them
> correspondingly
> > > for
> > > > > > easier
> > > > > > > > > catch
> > > > > > > > > > > and
> > > > > > > > > > > > > >> retry.
> > > > > > > > > > > > > >> >> > Let
> > > > > > > > > > > > > >> >> > > me
> > > > > > > > > > > > > >> >> > > > > >> know
> > > > > > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > > > --
> > > > > > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > >> >> > --
> > > > > > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > > > > > >> >> >
> > > > > > > > > > > > > >> >>
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >> > --
> > > > > > > > > > > > > >> > -- Guozhang
> > > > > > > > > > > > > >> >
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >> --
> > > > > > > > > > > > > >> -- Guozhang
> > > > > > > > > > > > > >>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > -- Guozhang
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Thanks Jason, my hope is that we could streamline the callback function
signature, so that in the next major release (3.0),
onCompletion(RecordMetadata metadata, SendFailure sendFailure) will become
the only API to be implemented.

On Fri, Jan 29, 2021 at 3:42 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Boyang,
>
> Do you think it's necessary to deprecate the old `onCompletion` callback? I
> was thinking there's probably no harm leaving it around. Users might not
> care about the failure type. Other than that, it looks good to me.
>
> Thanks,
> Jason
>
> On Thu, Jan 28, 2021 at 6:58 PM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Hey Guozhang,
> >
> > the CommitFailedException wrapping would cover all the non-fatal
> exceptions
> > as we listed out in the KIP, generally speaking any exception that could
> > recover safely by calling abortTxn should be wrapped.
> >
> > Best,
> > Boyang
> >
> > On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Hey Boyang,
> > >
> > > I think maybe there's something cracking here :) I'm just asking for
> > > clarifications that as of today, which non-fatal exceptions the newly
> > > introduced CommitFailedException would cover, and it seems to be only
> 1)
> > > unknown pid, 2) invalid pid mapping, and 3) concurrent transactions. Is
> > > that correct?
> > >
> > > On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Hey Guozhang, I think TimeoutException would not be covered here as
> it
> > > > potentially has a risk of hitting an illegal state on the broker side
> > > when
> > > > the previous commit was actually successful. Users should try to
> > increase
> > > > their max.block.ms to avoid hitting the timeout as a base
> suggestion,
> > > > which
> > > > is discussed in the KIP.
> > > >
> > > > As for the umbrella exception, I agree it has values to some extent,
> > but
> > > in
> > > > terms of adoption difficulty, additional exception types usually make
> > the
> > > > error handling more complex than simplified, and we are doing our
> best
> > to
> > > > avoid compatibility issues caused by wrapping previously thrown raw
> > > > exceptions that invalidates user error handling. And I'm also not in
> > > favor
> > > > of our exception hierarchy today where all exceptions are subclasses
> of
> > > > KafkaException, and we already mixed its handling with other specific
> > > > exception types that could be caught as superclass KafkaException.
> That
> > > > could be addressed in a separate proposal if we see necessary.
> > > >
> > > > Boyang
> > > >
> > > > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks Jason for the suggestion, that looks good to me too.
> > > > >
> > > > > Regarding the non-fatal exceptions wrapped as CommitFailed, I would
> > > like
> > > > to
> > > > > clarify if we would cover all the following cases: 1) timeout, 2)
> > > unknown
> > > > > pid, 3) invalid pid mapping, 4) concurrent transactions?
> > > > >
> > > > > BTW I think it still makes sense to use an umbrella exception in
> case
> > > in
> > > > > the future we add more non-fatal cases even though today we only
> > have a
> > > > > few.
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks Jason, I agree with the proposed solution here, will
> update
> > > the
> > > > > KIP.
> > > > > >
> > > > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <
> > jason@confluent.io
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Boyang,
> > > > > > >
> > > > > > > It seems like a reasonable suggestion. I wonder if a flag is
> > > > sufficient
> > > > > > > though. The current `Callback` documentation treats "fatal"
> > errors
> > > > from
> > > > > > the
> > > > > > > perspective of the individual message that was sent.
> > > > > > >
> > > > > > > ```
> > > > > > >     *                  Non-Retriable exceptions (fatal, the
> > message
> > > > > will
> > > > > > > never be sent):
> > > > > > > ```
> > > > > > >
> > > > > > > However, we also have fatal errors from the perspective of the
> > > > > > transaction
> > > > > > > (e.g. when the producer gets fenced). Perhaps that suggests we
> > need
> > > > > > > something richer than a boolean flag:
> > > > > > >
> > > > > > > At a high level, I think the following cases are possible:
> > > > > > >
> > > > > > > - message rejected (e.g. message too large, invalid topic)
> > > > > > > - delivery failed after retries/delivery timeout (e.g. timeout,
> > crc
> > > > > > error,
> > > > > > > not enough replicas)
> > > > > > > - transaction failed (e.g. producer fenced, invalid transaction
> > > > state)
> > > > > > >
> > > > > > > Perhaps instead we can have a type like the following:
> > > > > > >
> > > > > > > class SendFailure {
> > > > > > >   FailureType failureType;
> > > > > > >   Exception cause;
> > > > > > > }
> > > > > > >
> > > > > > > enum FailureType {
> > > > > > >   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> > > > > > > }
> > > > > > >
> > > > > > > (Not married to any of these names, just a starting point.)
> > > > > > >
> > > > > > > Then we add a new `onCompletion` as you've suggested:
> > > > > > >
> > > > > > > default void onCompletion(RecordMetadata metadata, SendFailure
> > > > > failure) {
> > > > > > >   this.onCompletion(metadata, failure.cause());
> > > > > > > }
> > > > > > >
> > > > > > > This would give streams and other applications enough
> information
> > > to
> > > > > know
> > > > > > > whether the message can be retried and whether the transaction
> > can
> > > be
> > > > > > > aborted.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > -Jason
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > > > > reluctanthero104@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks Jason for the thoughts.
> > > > > > > >
> > > > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <
> > > > jason@confluent.io
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Boyang,
> > > > > > > > >
> > > > > > > > > Thanks for the iterations here. I think this is something
> we
> > > > should
> > > > > > > have
> > > > > > > > > done a long time ago. It sounds like there are two API
> > changes
> > > > > here:
> > > > > > > > >
> > > > > > > > > 1. We are introducing the `CommitFailedException` to wrap
> > > > abortable
> > > > > > > > > errors that are raised from `commitTransaction`. This
> sounds
> > > fine
> > > > > to
> > > > > > > me.
> > > > > > > > As
> > > > > > > > > far as I know, the only case we might need this is when we
> > add
> > > > > > support
> > > > > > > to
> > > > > > > > > let producers recover from coordinator timeouts. Are there
> > any
> > > > > > others?
> > > > > > > > >
> > > > > > > > > I think the purpose here is to ensure non-fatal exceptions
> > are
> > > > > > unified
> > > > > > > > under the same
> > > > > > > > exception umbrella, to make the proceeding to abort any
> ongoing
> > > > > > > transaction
> > > > > > > > much easier.
> > > > > > > > I don't think `coordinator timeouts` is the only case to
> > recover
> > > > > here,
> > > > > > > > since we have other
> > > > > > > > non-fatal exceptions such as UnknownPid.
> > > > > > > >
> > > > > > > > 2. We are wrapping non-fatal errors raised from `send` as
> > > > > > > `KafkaException`.
> > > > > > > > > The motivation for this is less clear to me and it doesn't
> > look
> > > > > like
> > > > > > > the
> > > > > > > > > example from the KIP depends on it. My concern here is
> > > > > compatibility.
> > > > > > > > > Currently we have the following documentation for the
> > > `Callback`
> > > > > API:
> > > > > > > > >
> > > > > > > > > ```
> > > > > > > > >      *                  Non-Retriable exceptions (fatal,
> the
> > > > > message
> > > > > > > will
> > > > > > > > > never be sent):
> > > > > > > > >      *
> > > > > > > > >      *                  InvalidTopicException
> > > > > > > > >      *                  OffsetMetadataTooLargeException
> > > > > > > > >      *                  RecordBatchTooLargeException
> > > > > > > > >      *                  RecordTooLargeException
> > > > > > > > >      *                  UnknownServerException
> > > > > > > > >      *                  UnknownProducerIdException
> > > > > > > > >      *                  InvalidProducerEpochException
> > > > > > > > >      *
> > > > > > > > >      *                  Retriable exceptions (transient,
> may
> > be
> > > > > > covered
> > > > > > > > by
> > > > > > > > > increasing #.retries):
> > > > > > > > >      *
> > > > > > > > >      *                  CorruptRecordException
> > > > > > > > >      *                  InvalidMetadataException
> > > > > > > > >      *
> NotEnoughReplicasAfterAppendException
> > > > > > > > >      *                  NotEnoughReplicasException
> > > > > > > > >      *                  OffsetOutOfRangeException
> > > > > > > > >      *                  TimeoutException
> > > > > > > > >      *                  UnknownTopicOrPartitionException
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > If we wrap all the retriable exceptions documented here as
> > > > > > > > > `KafkaException`, wouldn't that break any error handling
> that
> > > > users
> > > > > > > might
> > > > > > > > > already have? it's gonna introduce a compatibility issue.
> > > > > > > > >
> > > > > > > > > The original intention was to simplify `send` callback
> error
> > > > > handling
> > > > > > > by
> > > > > > > > doing exception wrapping, as on Streams level
> > > > > > > > we have to prepare an exhausting list of exceptions to catch
> as
> > > > > fatal,
> > > > > > > and
> > > > > > > > the same lengthy list to catch as
> > > > > > > > non-fatal. It would be much easier if we got `hints` from the
> > > > > callback.
> > > > > > > > However,
> > > > > > > > I agree there is a compatibility concern, what about
> > deprecating
> > > > the
> > > > > > > > existing:
> > > > > > > >
> > > > > > > > void onCompletion(RecordMetadata metadata, Exception
> exception)
> > > > > > > >
> > > > > > > > and replace it with:
> > > > > > > >
> > > > > > > > default void onCompletion(RecordMetadata metadata, Exception
> > > > > exception,
> > > > > > > > boolean isFatal) {
> > > > > > > >   this.onCompletion(metadata, exception);
> > > > > > > > }
> > > > > > > >
> > > > > > > > to make sure new users get the benefit of understanding the
> > > > fatality
> > > > > > > based
> > > > > > > > on the info presented by the producer?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > > Jason
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <
> > info@hiringuru.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Why  we are receiving all emails kindly remove us from
> > > > > > > > > > dev@kafka.apache.org we don't want to receive emails
> > > anymore.
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <
> wangguoz@gmail.com>
> > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks Boyang, yes I think I was confused about the
> > > different
> > > > > > > > handling
> > > > > > > > > of
> > > > > > > > > > > two abortTxn calls, and now I get it was not
> > intentional. I
> > > > > > think I
> > > > > > > > do
> > > > > > > > > > not
> > > > > > > > > > > have more concerns.
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thanks for the clarification Guozhang, I got your
> point
> > > > that
> > > > > we
> > > > > > > > want
> > > > > > > > > to
> > > > > > > > > > > > have a consistent handling of fatal exceptions being
> > > thrown
> > > > > > from
> > > > > > > > the
> > > > > > > > > > > > abortTxn. I modified the current template to move the
> > > fatal
> > > > > > > > exception
> > > > > > > > > > > > try-catch outside of the processing loop to make sure
> > we
> > > > > could
> > > > > > > get
> > > > > > > > a
> > > > > > > > > > chance
> > > > > > > > > > > > to close consumer/producer modules. Let me know what
> > you
> > > > > think.
> > > > > > > > > > > >
> > > > > > > > > > > > Best,
> > > > > > > > > > > > Boyang
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > My understanding is that abortTransaction would
> only
> > > > throw
> > > > > > when
> > > > > > > > the
> > > > > > > > > > > > > producer is in fatal state. For CommitFailed, the
> > > > producer
> > > > > > > should
> > > > > > > > > > still
> > > > > > > > > > > > be
> > > > > > > > > > > > > in the abortable error state, so that
> > abortTransaction
> > > > call
> > > > > > > would
> > > > > > > > > not
> > > > > > > > > > > > throw.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <
> > > > > > > > wangguoz@gmail.com
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Or are you going to maintain some internal state
> > such
> > > > > that,
> > > > > > > the
> > > > > > > > > > > > >> `abortTransaction` in the catch block would never
> > > throw
> > > > > > again?
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > >> wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > I've also thought about this (i.e. using
> > > CommitFailed
> > > > > for
> > > > > > > all
> > > > > > > > > > > > >> non-fatal),
> > > > > > > > > > > > >> > but what I'm pondering is that, in the catch
> > > > > > (CommitFailed)
> > > > > > > > > block,
> > > > > > > > > > > > what
> > > > > > > > > > > > >> > would happen if the
> `producer.abortTransaction();`
> > > > > throws
> > > > > > > > again?
> > > > > > > > > > > > should
> > > > > > > > > > > > >> > that be captured as a fatal and cause the client
> > to
> > > > > close
> > > > > > > > again.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > If yes, then naively the pattern would be:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > ...
> > > > > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > > > > >> >         // Transaction commit failed with
> > abortable
> > > > > error,
> > > > > > > > user
> > > > > > > > > > could
> > > > > > > > > > > > >> reset
> > > > > > > > > > > > >> >         // the application state and resume
> with a
> > > new
> > > > > > > > > > transaction.
> > > > > > > > > > > > The
> > > > > > > > > > > > >> > root
> > > > > > > > > > > > >> >         // cause was wrapped in the thrown
> > > exception.
> > > > > > > > > > > > >> >         resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > >> >         try {
> > > > > > > > > > > > >> >             producer.abortTransaction();
> > > > > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > > > > >> >             producer.close();
> > > > > > > > > > > > >> >             consumer.close();
> > > > > > > > > > > > >> >             throw e;
> > > > > > > > > > > > >> >         }
> > > > > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > > > > >> >         producer.close();
> > > > > > > > > > > > >> >         consumer.close();
> > > > > > > > > > > > >> >         throw e;
> > > > > > > > > > > > >> >     }
> > > > > > > > > > > > >> > ...
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Guozhang
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> Jason and I were discussing the new API offline
> > and
> > > > > > decided
> > > > > > > > to
> > > > > > > > > > take
> > > > > > > > > > > > >> >> another
> > > > > > > > > > > > >> >> approach. Firstly, the reason not to invent a
> new
> > > API
> > > > > > with
> > > > > > > > > > returned
> > > > > > > > > > > > >> >> boolean
> > > > > > > > > > > > >> >> flag is for compatibility consideration, since
> > old
> > > > EOS
> > > > > > code
> > > > > > > > > > would not
> > > > > > > > > > > > >> know
> > > > > > > > > > > > >> >> that a given transaction commit was failed
> > > internally
> > > > > as
> > > > > > > they
> > > > > > > > > > don't
> > > > > > > > > > > > >> listen
> > > > > > > > > > > > >> >> to the output flag. Our proposed alternative
> > > solution
> > > > > is
> > > > > > to
> > > > > > > > let
> > > > > > > > > > > > >> >> *commitTransaction
> > > > > > > > > > > > >> >> throw CommitFailedException whenever the commit
> > > > failed
> > > > > > with
> > > > > > > > > > non-fatal
> > > > > > > > > > > > >> >> exception*, so that on the caller side the
> > handling
> > > > > logic
> > > > > > > > > > becomes:
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> try {
> > > > > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > > > > >> >>             producer.commitTransaction();
> > > > > > > > > > > > >> >>         } else {
> > > > > > > > > > > > >> >>
> > >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > >> >>             producer.abortTransaction();
> > > > > > > > > > > > >> >>         }
> > > > > > > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > > > > > > >> >>         // Transaction commit failed with
> > abortable
> > > > > > error,
> > > > > > > > user
> > > > > > > > > > could
> > > > > > > > > > > > >> >> reset
> > > > > > > > > > > > >> >>         // the application state and resume
> with
> > a
> > > > new
> > > > > > > > > > transaction.
> > > > > > > > > > > > The
> > > > > > > > > > > > >> >> root
> > > > > > > > > > > > >> >>         // cause was wrapped in the thrown
> > > exception.
> > > > > > > > > > > > >> >>
>  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > > > > >> >>         producer.close();
> > > > > > > > > > > > >> >>         consumer.close();
> > > > > > > > > > > > >> >>         throw e;
> > > > > > > > > > > > >> >>     }
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> This approach looks cleaner as all exception
> > types
> > > > > other
> > > > > > > than
> > > > > > > > > > > > >> CommitFailed
> > > > > > > > > > > > >> >> will doom to be fatal, which is very easy to
> > adopt
> > > > for
> > > > > > > users.
> > > > > > > > > In
> > > > > > > > > > the
> > > > > > > > > > > > >> >> meantime, we still maintain the commitTxn
> > behavior
> > > to
> > > > > > throw
> > > > > > > > > > instead
> > > > > > > > > > > > of
> > > > > > > > > > > > >> >> silently failing.
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> In addition, we decided to drop the
> > recommendation
> > > to
> > > > > > > handle
> > > > > > > > > > > > >> >> TimeoutException and leave it to the users to
> > make
> > > > the
> > > > > > > call.
> > > > > > > > > The
> > > > > > > > > > > > >> downside
> > > > > > > > > > > > >> >> for blindly calling abortTxn upon timeout is
> that
> > > we
> > > > > > could
> > > > > > > > > > result in
> > > > > > > > > > > > an
> > > > > > > > > > > > >> >> illegal state when the commit was already
> > > successful
> > > > on
> > > > > > the
> > > > > > > > > > broker
> > > > > > > > > > > > >> >> side. Without a good guarantee on the outcome,
> > > > > > > > overcomplicating
> > > > > > > > > > the
> > > > > > > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> Let me know your thoughts on the new approach
> > here,
> > > > > thank
> > > > > > > > you!
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> Best,
> > > > > > > > > > > > >> >> Boyang
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang
> <
> > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > >> >> wrote:
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >> > Thanks for your clarification on 2)/3), that
> > > makes
> > > > > > sense.
> > > > > > > > > > > > >> >> >
> > > > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen
> <
> > > > > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > >> >> >
> > > > > > > > > > > > >> >> > > Thanks for the input Guozhang, replied
> > inline.
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang
> > Wang <
> > > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > Thanks for the updated KIP. I read it
> again
> > > and
> > > > > > have
> > > > > > > > the
> > > > > > > > > > > > >> following
> > > > > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > 0. I'm a bit concerned that if commitTxn
> > does
> > > > not
> > > > > > > throw
> > > > > > > > > any
> > > > > > > > > > > > >> >> non-fatal
> > > > > > > > > > > > >> >> > > > exception, and instead we rely on the
> > > > subsequent
> > > > > > > > beginTxn
> > > > > > > > > > call
> > > > > > > > > > > > to
> > > > > > > > > > > > >> >> > throw,
> > > > > > > > > > > > >> >> > > it
> > > > > > > > > > > > >> >> > > > may violate some callers with a pattern
> > that
> > > > > > relying
> > > > > > > on
> > > > > > > > > > > > >> commitTxn to
> > > > > > > > > > > > >> >> > > > succeed to make some non-rollback
> > operations.
> > > > For
> > > > > > > > > example:
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > > > > >> >> > > > // if commitTxn succeeds, then commit the
> > DB
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > -------------
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > The issue is that, committing DB is a
> > > > > non-rollback
> > > > > > > > > > operation,
> > > > > > > > > > > > and
> > > > > > > > > > > > >> >> users
> > > > > > > > > > > > >> >> > > may
> > > > > > > > > > > > >> >> > > > just rely on commitTxn to return without
> > > error
> > > > to
> > > > > > > make
> > > > > > > > > this
> > > > > > > > > > > > >> >> > non-rollback
> > > > > > > > > > > > >> >> > > > call. Of course we can just claim this
> > > pattern
> > > > is
> > > > > > not
> > > > > > > > > > > > legitimate
> > > > > > > > > > > > >> >> and is
> > > > > > > > > > > > >> >> > > not
> > > > > > > > > > > > >> >> > > > the right way of doing things, but many
> > users
> > > > may
> > > > > > > > > naturally
> > > > > > > > > > > > adopt
> > > > > > > > > > > > >> >> this
> > > > > > > > > > > > >> >> > > > pattern.
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > So maybe we should still let commitTxn
> also
> > > > throw
> > > > > > > > > non-fatal
> > > > > > > > > > > > >> >> exceptions,
> > > > > > > > > > > > >> >> > > in
> > > > > > > > > > > > >> >> > > > which case we would then call abortTxn
> > again.
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > But if we do this, the pattern becomes:
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > > > > >> >> > > >     try {
> > > > > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > > > > >> >> > > >     } catch (...) {        // enumerate
> all
> > > > fatal
> > > > > > > > > > exceptions
> > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > Thank makes me think, the alternative we
> > have
> > > > > > > discussed
> > > > > > > > > > offline
> > > > > > > > > > > > >> may
> > > > > > > > > > > > >> >> be
> > > > > > > > > > > > >> >> > > > better: let commitTxn() to return a
> boolean
> > > > flag.
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > * If it returns true, it means the commit
> > > > > > succeeded.
> > > > > > > > > Users
> > > > > > > > > > can
> > > > > > > > > > > > >> >> > > comfortably
> > > > > > > > > > > > >> >> > > > continue and do any external non-rollback
> > > > > > operations
> > > > > > > if
> > > > > > > > > > they
> > > > > > > > > > > > >> like.
> > > > > > > > > > > > >> >> > > > * If it returns false, it means the
> commit
> > > > failed
> > > > > > > with
> > > > > > > > > > > > non-fatal
> > > > > > > > > > > > >> >> error
> > > > > > > > > > > > >> >> > > *AND
> > > > > > > > > > > > >> >> > > > the txn has been aborted*. Users do not
> > need
> > > to
> > > > > > call
> > > > > > > > > > abortTxn
> > > > > > > > > > > > >> again.
> > > > > > > > > > > > >> >> > > > * If it throws, then it means fatal
> errors.
> > > > Users
> > > > > > > > should
> > > > > > > > > > shut
> > > > > > > > > > > > >> down
> > > > > > > > > > > > >> >> the
> > > > > > > > > > > > >> >> > > > client.
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > > > > > > > >> >> > > >     } else {
> > > > > > > > > > > > >> >> > > >         // reset offsets, rollback
> > > operations,
> > > > > etc
> > > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > > > > >> >> > > >    // do other non-rollbackable things
> > > > > > > > > > > > >> >> > > > else
> > > > > > > > > > > > >> >> > > >    // reset offsets, rollback operations,
> > etc
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > Of course, if we want to have better
> > > visibility
> > > > > > into
> > > > > > > > what
> > > > > > > > > > > > caused
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> >> > > commit
> > > > > > > > > > > > >> >> > > > to fail and txn to abort. We can let the
> > > return
> > > > > > type
> > > > > > > be
> > > > > > > > > an
> > > > > > > > > > enum
> > > > > > > > > > > > >> >> instead
> > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > >> >> > > > a flag. But my main idea is to still let
> > the
> > > > > > > commitTxn
> > > > > > > > be
> > > > > > > > > > the
> > > > > > > > > > > > >> final
> > > > > > > > > > > > >> >> > point
> > > > > > > > > > > > >> >> > > > users can tell whether this txn succeeded
> > or
> > > > not,
> > > > > > > > instead
> > > > > > > > > > of
> > > > > > > > > > > > >> >> relying on
> > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > I agree that adding a boolean flag is
> > indeed
> > > > > useful
> > > > > > > in
> > > > > > > > > this
> > > > > > > > > > > > case.
> > > > > > > > > > > > >> >> Will
> > > > > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> > > 1. Re: "while maintaining the behavior to
> > throw
> > > > > fatal
> > > > > > > > > > exception
> > > > > > > > > > > > in
> > > > > > > > > > > > >> >> raw"
> > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > >> >> > > > sure what do you mean by "throw" here.
> Are
> > > you
> > > > > > > > proposing
> > > > > > > > > > the
> > > > > > > > > > > > >> >> callback
> > > > > > > > > > > > >> >> > > would
> > > > > > > > > > > > >> >> > > > *pass* (not throw) in any fatal
> exceptions
> > > when
> > > > > > > > triggered
> > > > > > > > > > > > without
> > > > > > > > > > > > >> >> > > wrapping?
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > Yes, I want to say *pass*, the benefit is
> > to
> > > > make
> > > > > > the
> > > > > > > > end
> > > > > > > > > > > > user's
> > > > > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> > > > 2. I'm not sure I understand the claim
> > > > regarding
> > > > > > the
> > > > > > > > > > callback
> > > > > > > > > > > > >> that
> > > > > > > > > > > > >> >> "In
> > > > > > > > > > > > >> >> > > EOS
> > > > > > > > > > > > >> >> > > > setup, it is not required to handle the
> > > > > exception".
> > > > > > > Are
> > > > > > > > > you
> > > > > > > > > > > > >> >> proposing
> > > > > > > > > > > > >> >> > > that,
> > > > > > > > > > > > >> >> > > > e.g. in Streams, we do not try to handle
> > any
> > > > > > > exceptions
> > > > > > > > > if
> > > > > > > > > > EOS
> > > > > > > > > > > > is
> > > > > > > > > > > > >> >> > enabled
> > > > > > > > > > > > >> >> > > > in the callback anymore, but just let
> > > > commitTxn()
> > > > > > > > itself
> > > > > > > > > to
> > > > > > > > > > > > fail
> > > > > > > > > > > > >> to
> > > > > > > > > > > > >> >> be
> > > > > > > > > > > > >> >> > > > notified about the problem? Personally I
> > > think
> > > > in
> > > > > > > > Streams
> > > > > > > > > > we
> > > > > > > > > > > > >> should
> > > > > > > > > > > > >> >> > just
> > > > > > > > > > > > >> >> > > > make the handling logic of the callback
> to
> > be
> > > > > > > > consistent
> > > > > > > > > > > > >> regardless
> > > > > > > > > > > > >> >> of
> > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > >> >> > > > EOS settings (today we have different
> logic
> > > > > > depending
> > > > > > > > on
> > > > > > > > > > this
> > > > > > > > > > > > >> logic,
> > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > >> >> > > > I think could be unified as well).
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > My idea originates from the claim on send
> > > API:
> > > > > > > > > > > > >> >> > > "When used as part of a transaction, it is
> > not
> > > > > > > necessary
> > > > > > > > to
> > > > > > > > > > > > define
> > > > > > > > > > > > >> a
> > > > > > > > > > > > >> >> > > callback or check the result of the future
> > in
> > > > > order
> > > > > > to
> > > > > > > > > > detect
> > > > > > > > > > > > >> errors
> > > > > > > > > > > > >> >> > from
> > > > > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > > > > >> >> > > My understanding is that for EOS, the
> > exception
> > > > > will
> > > > > > be
> > > > > > > > > > detected
> > > > > > > > > > > > by
> > > > > > > > > > > > >> >> > calling
> > > > > > > > > > > > >> >> > > transactional APIs either way, so it is a
> > > > duplicate
> > > > > > > > > handling
> > > > > > > > > > to
> > > > > > > > > > > > >> track
> > > > > > > > > > > > >> >> > > all the sendExceptions in RecordCollector.
> > > > > However, I
> > > > > > > > > looked
> > > > > > > > > > up
> > > > > > > > > > > > >> >> > > sendException is being used today as
> follow:
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for
> > any
> > > > > > default
> > > > > > > > or
> > > > > > > > > > > > >> customized
> > > > > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > > > > >> >> > > 2. Stop collecting offset info or new
> > > exceptions
> > > > > > > > > > > > >> >> > > 3. Check and rethrow exceptions in close(),
> > > > flush()
> > > > > > or
> > > > > > > > new
> > > > > > > > > > send()
> > > > > > > > > > > > >> >> calls
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> > > To my understanding, we should still honor
> > the
> > > > > > > commitment
> > > > > > > > > to
> > > > > > > > > > #1
> > > > > > > > > > > > for
> > > > > > > > > > > > >> >> any
> > > > > > > > > > > > >> >> > > user customized implementation. The #2 does
> > not
> > > > > > really
> > > > > > > > > > affect our
> > > > > > > > > > > > >> >> > decision
> > > > > > > > > > > > >> >> > > upon EOS. The #3 here is still valuable as
> it
> > > > could
> > > > > > > help
> > > > > > > > us
> > > > > > > > > > fail
> > > > > > > > > > > > >> fast
> > > > > > > > > > > > >> >> in
> > > > > > > > > > > > >> >> > > new send() instead of waiting to later
> stage
> > of
> > > > > > > > processing.
> > > > > > > > > > In
> > > > > > > > > > > > that
> > > > > > > > > > > > >> >> > sense,
> > > > > > > > > > > > >> >> > > I agree we should continue to record send
> > > > > exceptions
> > > > > > > even
> > > > > > > > > > under
> > > > > > > > > > > > EOS
> > > > > > > > > > > > >> >> case
> > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > >> >> > > ensure the strength of stream side Producer
> > > > logic.
> > > > > On
> > > > > > > the
> > > > > > > > > > safer
> > > > > > > > > > > > >> side,
> > > > > > > > > > > > >> >> we
> > > > > > > > > > > > >> >> > no
> > > > > > > > > > > > >> >> > > longer need to wrap certain fatal
> exceptions
> > > like
> > > > > > > > > > ProducerFenced
> > > > > > > > > > > > as
> > > > > > > > > > > > >> >> > > TaskMigrated, since they should not crash
> the
> > > > > stream
> > > > > > > > thread
> > > > > > > > > > if
> > > > > > > > > > > > >> thrown
> > > > > > > > > > > > >> >> in
> > > > > > > > > > > > >> >> > > raw format, once we adopt the new
> processing
> > > > model
> > > > > in
> > > > > > > the
> > > > > > > > > > send
> > > > > > > > > > > > >> phase.
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang
> > Chen <
> > > > > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > > Thanks for everyone's feedback so far.
> I
> > > have
> > > > > > > > polished
> > > > > > > > > > the
> > > > > > > > > > > > KIP
> > > > > > > > > > > > >> >> after
> > > > > > > > > > > > >> >> > > > > offline discussion with some folks
> > working
> > > on
> > > > > EOS
> > > > > > > to
> > > > > > > > > > make the
> > > > > > > > > > > > >> >> > exception
> > > > > > > > > > > > >> >> > > > > handling more lightweight. The
> essential
> > > > change
> > > > > > is
> > > > > > > > that
> > > > > > > > > > we
> > > > > > > > > > > > are
> > > > > > > > > > > > >> not
> > > > > > > > > > > > >> >> > > > > inventing a new intermediate exception
> > > type,
> > > > > but
> > > > > > > > > instead
> > > > > > > > > > > > >> >> separating a
> > > > > > > > > > > > >> >> > > > full
> > > > > > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > >> >> > > > > This way for any exception thrown from
> > > phase
> > > > 1,
> > > > > > > will
> > > > > > > > be
> > > > > > > > > > an
> > > > > > > > > > > > >> >> indicator
> > > > > > > > > > > > >> >> > in
> > > > > > > > > > > > >> >> > > > > phase 2 whether we should do commit or
> > > abort,
> > > > > and
> > > > > > > > from
> > > > > > > > > > now on
> > > > > > > > > > > > >> >> > > > > `commitTransaction` should only throw
> > fatal
> > > > > > > > exceptions,
> > > > > > > > > > > > >> similar to
> > > > > > > > > > > > >> >> > > > > `abortTransaction`, so that any
> > > > KafkaException
> > > > > > > caught
> > > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > > >> >> commit
> > > > > > > > > > > > >> >> > > phase
> > > > > > > > > > > > >> >> > > > > will be definitely fatal to crash the
> > app.
> > > > For
> > > > > > more
> > > > > > > > > > advanced
> > > > > > > > > > > > >> users
> > > > > > > > > > > > >> >> > such
> > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > >> >> > > > > Streams, we have the ability to further
> > > wrap
> > > > > > > selected
> > > > > > > > > > types
> > > > > > > > > > > > of
> > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > >> >> > > > > exceptions to trigger task migration if
> > > > > > necessary.
> > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > >> >> > > > > More details in the KIP, feel free to
> > take
> > > > > > another
> > > > > > > > > look,
> > > > > > > > > > > > >> thanks!
> > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang
> > > Chen <
> > > > > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno
> > > > Cadonna
> > > > > <
> > > > > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >> Like Matthias, I do also not know
> the
> > > > > producer
> > > > > > > > > > internal
> > > > > > > > > > > > well
> > > > > > > > > > > > >> >> > enough
> > > > > > > > > > > > >> >> > > to
> > > > > > > > > > > > >> >> > > > > >> comment on the categorization.
> > However,
> > > I
> > > > > > think
> > > > > > > > > > having a
> > > > > > > > > > > > >> super
> > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > >> >> > > > > >> (e.g. RetriableException) that
> > encodes
> > > if
> > > > > an
> > > > > > > > > > exception is
> > > > > > > > > > > > >> >> fatal
> > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > >> >> > > > not
> > > > > > > > > > > > >> >> > > > > >> is cleaner, better understandable
> and
> > > less
> > > > > > > > > > error-prone,
> > > > > > > > > > > > >> because
> > > > > > > > > > > > >> >> > > > ideally
> > > > > > > > > > > > >> >> > > > > >> when you add a new non-fatal
> exception
> > > in
> > > > > > future
> > > > > > > > you
> > > > > > > > > > just
> > > > > > > > > > > > >> need
> > > > > > > > > > > > >> >> to
> > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > >> >> > > > > >> about letting it inherit from the
> > super
> > > > > > > exception
> > > > > > > > > and
> > > > > > > > > > all
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> >> rest
> > > > > > > > > > > > >> >> > > of
> > > > > > > > > > > > >> >> > > > > >> the code will just behave correctly
> > > > without
> > > > > > the
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > > wrap
> > > > > > > > > > > > >> >> the
> > > > > > > > > > > > >> >> > new
> > > > > > > > > > > > >> >> > > > > >> exception into another exception
> each
> > > time
> > > > > it
> > > > > > is
> > > > > > > > > > thrown
> > > > > > > > > > > > >> (maybe
> > > > > > > > > > > > >> >> it
> > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > >> >> > > > > >> thrown at different location in the
> > > code).
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >> As far as I understand the following
> > > > > statement
> > > > > > > > from
> > > > > > > > > > your
> > > > > > > > > > > > >> >> previous
> > > > > > > > > > > > >> >> > > > e-mail
> > > > > > > > > > > > >> >> > > > > >> is the reason that currently such a
> > > super
> > > > > > > > exception
> > > > > > > > > > is not
> > > > > > > > > > > > >> >> > possible:
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >> "Right now we have
> RetriableException
> > > > type,
> > > > > if
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > > > going
> > > > > > > > > > > > >> to
> > > > > > > > > > > > >> >> > add a
> > > > > > > > > > > > >> >> > > > > >> `ProducerRetriableException` type,
> we
> > > have
> > > > > to
> > > > > > > put
> > > > > > > > > > this new
> > > > > > > > > > > > >> >> > interface
> > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > >> >> > > > > >> the parent of the
> RetriableException,
> > > > > because
> > > > > > > not
> > > > > > > > > all
> > > > > > > > > > > > thrown
> > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > >> >> > > > > >> exceptions are `retriable` in
> general
> > > for
> > > > > > > > producer"
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >> In the list of exceptions in your
> > KIP, I
> > > > > found
> > > > > > > > > > non-fatal
> > > > > > > > > > > > >> >> > exceptions
> > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > >> >> > > > > >> do not inherit from
> > RetriableException.
> > > I
> > > > > > guess
> > > > > > > > > those
> > > > > > > > > > are
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> >> ones
> > > > > > > > > > > > >> >> > > you
> > > > > > > > > > > > >> >> > > > > >> are referring to in your statement:
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >> All of those exceptions are
> non-fatal
> > > and
> > > > do
> > > > > > not
> > > > > > > > > > inherit
> > > > > > > > > > > > >> from
> > > > > > > > > > > > >> >> > > > > >> RetriableException. Is there a
> reason
> > > for
> > > > > > that?
> > > > > > > If
> > > > > > > > > > they
> > > > > > > > > > > > >> >> depended
> > > > > > > > > > > > >> >> > > from
> > > > > > > > > > > > >> >> > > > > >> RetriableException we would be a bit
> > > > closer
> > > > > > to a
> > > > > > > > > super
> > > > > > > > > > > > >> >> exception I
> > > > > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >> The reason is that sender may catch
> > > those
> > > > > > > > exceptions
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > >> >> > > > > > ProduceResponse, and it currently
> does
> > > > > infinite
> > > > > > > > > > > > >> >> > > > > > retries on RetriableException. To
> make
> > > sure
> > > > > we
> > > > > > > > could
> > > > > > > > > > > > trigger
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> >> > > > > > abortTransaction() by catching
> > non-fatal
> > > > > thrown
> > > > > > > > > > > > >> >> > > > > > exceptions and reinitialize the txn
> > > state,
> > > > we
> > > > > > > chose
> > > > > > > > > > not to
> > > > > > > > > > > > >> let
> > > > > > > > > > > > >> >> > those
> > > > > > > > > > > > >> >> > > > > > exceptions inherit RetriableException
> > so
> > > > that
> > > > > > > > > > > > >> >> > > > > > they won't cause infinite retry on
> > > sender.
> > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > > > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > > >> >> > > > > >> to understand that their fatality
> > > depends
> > > > on
> > > > > > the
> > > > > > > > > type
> > > > > > > > > > > > (i.e.
> > > > > > > > > > > > >> >> > > > > >> configuration) of the producer. That
> > > makes
> > > > > it
> > > > > > > > > > difficult to
> > > > > > > > > > > > >> >> have a
> > > > > > > > > > > > >> >> > > > super
> > > > > > > > > > > > >> >> > > > > >> exception that encodes the
> > retriability
> > > as
> > > > > > > > mentioned
> > > > > > > > > > > > above.
> > > > > > > > > > > > >> >> Would
> > > > > > > > > > > > >> >> > it
> > > > > > > > > > > > >> >> > > > be
> > > > > > > > > > > > >> >> > > > > >> possible to introduce exceptions
> that
> > > > > inherit
> > > > > > > from
> > > > > > > > > > > > >> >> > > RetriableException
> > > > > > > > > > > > >> >> > > > > >> and that are thrown when those
> > > exceptions
> > > > > are
> > > > > > > > caught
> > > > > > > > > > from
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> >> > > brokers
> > > > > > > > > > > > >> >> > > > > >> and the type of the producer is such
> > > that
> > > > > the
> > > > > > > > > > exceptions
> > > > > > > > > > > > are
> > > > > > > > > > > > >> >> > > > retriable?
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >> Yea, I think in general the
> exception
> > > type
> > > > > > > mixing
> > > > > > > > is
> > > > > > > > > > > > >> difficult
> > > > > > > > > > > > >> >> to
> > > > > > > > > > > > >> >> > > get
> > > > > > > > > > > > >> >> > > > > > them all right. I have already
> proposed
> > > > > another
> > > > > > > > > > solution
> > > > > > > > > > > > >> based
> > > > > > > > > > > > >> >> on
> > > > > > > > > > > > >> >> > my
> > > > > > > > > > > > >> >> > > > > > offline discussion with some folks
> > > working
> > > > on
> > > > > > EOS
> > > > > > > > > > > > >> >> > > > > > to make the handling more
> > straightforward
> > > > for
> > > > > > end
> > > > > > > > > users
> > > > > > > > > > > > >> without
> > > > > > > > > > > > >> >> the
> > > > > > > > > > > > >> >> > > > need
> > > > > > > > > > > > >> >> > > > > > to distinguish exception fatality.
> > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang
> > wrote:
> > > > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the proposal! I
> > > made a
> > > > > > pass
> > > > > > > > over
> > > > > > > > > > the
> > > > > > > > > > > > >> list
> > > > > > > > > > > > >> >> and
> > > > > > > > > > > > >> >> > > here
> > > > > > > > > > > > >> >> > > > > are
> > > > > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> > 0) Although this is not part of
> the
> > > > public
> > > > > > > API,
> > > > > > > > I
> > > > > > > > > > think
> > > > > > > > > > > > we
> > > > > > > > > > > > >> >> > should
> > > > > > > > > > > > >> >> > > > make
> > > > > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > > > > >> >> > > > > >> > that the suggested pattern (i.e.
> > user
> > > > can
> > > > > > > always
> > > > > > > > > > call
> > > > > > > > > > > > >> >> abortTxn()
> > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > >> >> > > > > >> > handling non-fatal errors) are
> > indeed
> > > > > > > supported.
> > > > > > > > > > E.g. if
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> >> txn
> > > > > > > > > > > > >> >> > > is
> > > > > > > > > > > > >> >> > > > > >> already
> > > > > > > > > > > > >> >> > > > > >> > aborted by the broker side, then
> > users
> > > > can
> > > > > > > still
> > > > > > > > > > call
> > > > > > > > > > > > >> >> abortTxn
> > > > > > > > > > > > >> >> > > which
> > > > > > > > > > > > >> >> > > > > >> would
> > > > > > > > > > > > >> >> > > > > >> > not throw another exception but
> just
> > > be
> > > > > > > treated
> > > > > > > > > as a
> > > > > > > > > > > > >> no-op.
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> > 1)
> > *ConcurrentTransactionsException*:
> > > I
> > > > > > think
> > > > > > > > this
> > > > > > > > > > error
> > > > > > > > > > > > >> can
> > > > > > > > > > > > >> >> > also
> > > > > > > > > > > > >> >> > > be
> > > > > > > > > > > > >> >> > > > > >> > returned but not documented yet.
> > This
> > > > > should
> > > > > > > be
> > > > > > > > a
> > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > >> >> > error.
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*:
> this
> > > > error
> > > > > is
> > > > > > > > > > returned
> > > > > > > > > > > > from
> > > > > > > > > > > > >> >> > broker
> > > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > > > > >> >> > > > > >> > state transition failed (e.g. it
> is
> > > > trying
> > > > > > to
> > > > > > > > > > transit to
> > > > > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > > > > >> >> > > > > >> > while the current state is not
> > > > > > > prepare-commit).
> > > > > > > > > This
> > > > > > > > > > > > error
> > > > > > > > > > > > >> >> could
> > > > > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > > > > >> >> > > > > >> > a bug on the client internal code
> or
> > > the
> > > > > > > broker
> > > > > > > > > > code,
> > > > > > > > > > > > OR a
> > > > > > > > > > > > >> >> user
> > > > > > > > > > > > >> >> > > > error
> > > > > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > > > > >> >> > > > > >> > similar error is
> > > > > > > > ConcurrentTransactionsException,
> > > > > > > > > > i.e.
> > > > > > > > > > > > if
> > > > > > > > > > > > >> >> Kafka
> > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > > > > >> >> > > > > >> > these exceptions would only be
> > > returned
> > > > if
> > > > > > > users
> > > > > > > > > > try to
> > > > > > > > > > > > do
> > > > > > > > > > > > >> >> > > something
> > > > > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn right after
> a
> > > > > > commitTxn,
> > > > > > > > > etc.
> > > > > > > > > > So
> > > > > > > > > > > > I'm
> > > > > > > > > > > > >> >> > > thinking
> > > > > > > > > > > > >> >> > > > it
> > > > > > > > > > > > >> >> > > > > >> > should be a non-fatal error
> instead
> > > of a
> > > > > > fatal
> > > > > > > > > > error,
> > > > > > > > > > > > >> wdyt?
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i
> > "indicates
> > > > > fatal
> > > > > > > > > > > > transactional
> > > > > > > > > > > > >> >> > > sequence
> > > > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit
> conflicting
> > > with
> > > > > the
> > > > > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > > > > >> >> > > > > >> that
> > > > > > > > > > > > >> >> > > > > >> > is treated as non-fatal. I guess
> > your
> > > > > > proposal
> > > > > > > > is
> > > > > > > > > > that
> > > > > > > > > > > > >> >> > > > > >> > OutOfOrderSequenceException would
> be
> > > > > treated
> > > > > > > > > either
> > > > > > > > > > as
> > > > > > > > > > > > >> fatal
> > > > > > > > > > > > >> >> > with
> > > > > > > > > > > > >> >> > > > > >> > transactional producer, or
> non-fatal
> > > > with
> > > > > > > > > idempotent
> > > > > > > > > > > > >> >> producer,
> > > > > > > > > > > > >> >> > is
> > > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > > >> >> > > > > >> > right? If the producer is only
> > > > configured
> > > > > > with
> > > > > > > > > > > > idempotency
> > > > > > > > > > > > >> >> but
> > > > > > > > > > > > >> >> > not
> > > > > > > > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > > > > >> >> > > for
> > > > > > > > > > > > >> >> > > > > >> > non-fatal errors would be
> confusing
> > > > since
> > > > > > > users
> > > > > > > > > are
> > > > > > > > > > not
> > > > > > > > > > > > >> using
> > > > > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > > > > >> >> > > > > >> > at all.. So I suggest we always
> > throw
> > > > > > > > > > > > >> OutOfSequenceException
> > > > > > > > > > > > >> >> > as-is
> > > > > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > > > > >> >> > > > > >> > not wrapped) no matter how the
> > > producer
> > > > is
> > > > > > > > > > configured,
> > > > > > > > > > > > and
> > > > > > > > > > > > >> >> let
> > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > > > > >> >> > > > > >> > decide how to handle it based on
> > > whether
> > > > > it
> > > > > > is
> > > > > > > > > only
> > > > > > > > > > > > >> >> idempotent
> > > > > > > > > > > > >> >> > or
> > > > > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the
> > > > `send()`
> > > > > > > > > callback /
> > > > > > > > > > > > >> future
> > > > > > > > > > > > >> >> can
> > > > > > > > > > > > >> >> > > also
> > > > > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > > > > >> >> > > > > >> > txn-related exceptions, I think
> this
> > > KIP
> > > > > > > should
> > > > > > > > > also
> > > > > > > > > > > > cover
> > > > > > > > > > > > >> >> this
> > > > > > > > > > > > >> >> > > API
> > > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> > 5) This is related to 1/2) above:
> > > > > sometimes
> > > > > > > > those
> > > > > > > > > > > > >> non-fatal
> > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > >> >> > > > > like
> > > > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState
> are
> > > not
> > > > > due
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > > > state
> > > > > > > > > > > > >> >> being
> > > > > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > > > > >> >> > > > > >> > at the broker side, but maybe
> users
> > > are
> > > > > > doing
> > > > > > > > > > something
> > > > > > > > > > > > >> >> wrong.
> > > > > > > > > > > > >> >> > So
> > > > > > > > > > > > >> >> > > > I'm
> > > > > > > > > > > > >> >> > > > > >> > wondering if we should further
> > > > distinguish
> > > > > > > those
> > > > > > > > > > > > non-fatal
> > > > > > > > > > > > >> >> > errors
> > > > > > > > > > > > >> >> > > > > >> between
> > > > > > > > > > > > >> >> > > > > >> > a) those that are caused by Kafka
> > > > itself,
> > > > > > > e.g. a
> > > > > > > > > > broker
> > > > > > > > > > > > >> timed
> > > > > > > > > > > > >> >> > out
> > > > > > > > > > > > >> >> > > > and
> > > > > > > > > > > > >> >> > > > > >> > aborted a txn and later an endTxn
> > > > request
> > > > > is
> > > > > > > > > > received,
> > > > > > > > > > > > >> and b)
> > > > > > > > > > > > >> >> > the
> > > > > > > > > > > > >> >> > > > > user's
> > > > > > > > > > > > >> >> > > > > >> > API call pattern is incorrect,
> > causing
> > > > the
> > > > > > > > request
> > > > > > > > > > to be
> > > > > > > > > > > > >> >> > rejected
> > > > > > > > > > > > >> >> > > > with
> > > > > > > > > > > > >> >> > > > > >> an
> > > > > > > > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > > > > > >> >> > > > feels
> > > > > > > > > > > > >> >> > > > > >> to
> > > > > > > > > > > > >> >> > > > > >> > me more like for case a), but not
> > case
> > > > b).
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM
> > Boyang
> > > > > Chen <
> > > > > > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your
> > > proposal
> > > > > > makes
> > > > > > > > > sense
> > > > > > > > > > as
> > > > > > > > > > > > >> well,
> > > > > > > > > > > > >> >> on
> > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > >> >> > > > > pro
> > > > > > > > > > > > >> >> > > > > >> side
> > > > > > > > > > > > >> >> > > > > >> >> we could have a universally
> agreed
> > > > > > exception
> > > > > > > > type
> > > > > > > > > > to be
> > > > > > > > > > > > >> >> caught
> > > > > > > > > > > > >> >> > by
> > > > > > > > > > > > >> >> > > > the
> > > > > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > > > > >> >> > > > > >> >> without having an extra layer on
> > top
> > > of
> > > > > the
> > > > > > > > > actual
> > > > > > > > > > > > >> >> exceptions.
> > > > > > > > > > > > >> >> > I
> > > > > > > > > > > > >> >> > > > > could
> > > > > > > > > > > > >> >> > > > > >> see
> > > > > > > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > >> >> > > > > >> >> 1. The exception hierarchy will
> be
> > > more
> > > > > > > > complex.
> > > > > > > > > > Right
> > > > > > > > > > > > >> now
> > > > > > > > > > > > >> >> we
> > > > > > > > > > > > >> >> > > have
> > > > > > > > > > > > >> >> > > > > >> >> RetriableException type, if we
> are
> > > > going
> > > > > to
> > > > > > > > add a
> > > > > > > > > > > > >> >> > > > > >> >> `ProducerRetriableException`
> type,
> > we
> > > > > have
> > > > > > to
> > > > > > > > put
> > > > > > > > > > this
> > > > > > > > > > > > >> new
> > > > > > > > > > > > >> >> > > > interface
> > > > > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > > > > >> >> > > > > >> >> parent of the RetriableException,
> > > > because
> > > > > > not
> > > > > > > > all
> > > > > > > > > > > > thrown
> > > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > > >> >> > > > > >> >> exceptions are `retriable` in
> > general
> > > > for
> > > > > > > > > > producer, for
> > > > > > > > > > > > >> >> example
> > > > > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> >
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > > > > >> >> > > > > >> >> We could have an empty interface
> > > > > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > > > > >> >> > to
> > > > > > > > > > > > >> >> > > > let
> > > > > > > > > > > > >> >> > > > > >> all
> > > > > > > > > > > > >> >> > > > > >> >> the thrown exceptions implement
> for
> > > > sure,
> > > > > > but
> > > > > > > > > it's
> > > > > > > > > > a
> > > > > > > > > > > > bit
> > > > > > > > > > > > >> >> > > unorthodox
> > > > > > > > > > > > >> >> > > > > in
> > > > > > > > > > > > >> >> > > > > >> the
> > > > > > > > > > > > >> >> > > > > >> >> way we deal with exceptions in
> > > general.
> > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > >> >> > > > > >> >> 2. There are cases where we
> throw a
> > > > > > > > > KafkaException
> > > > > > > > > > > > >> wrapping
> > > > > > > > > > > > >> >> > > another
> > > > > > > > > > > > >> >> > > > > >> >> KafkaException as either fatal or
> > > > > > non-fatal.
> > > > > > > If
> > > > > > > > > we
> > > > > > > > > > use
> > > > > > > > > > > > an
> > > > > > > > > > > > >> >> > > interface
> > > > > > > > > > > > >> >> > > > > to
> > > > > > > > > > > > >> >> > > > > >> >> solve #1, it is also required to
> > > > > implement
> > > > > > > > > another
> > > > > > > > > > > > >> bloated
> > > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > >> >> > > > > >> >> which could replace
> KafkaException
> > > > type,
> > > > > as
> > > > > > > we
> > > > > > > > > > couldn't
> > > > > > > > > > > > >> mark
> > > > > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > >> >> > > > > >> >> 3. In terms of the encapsulation,
> > > > > wrapping
> > > > > > > > means
> > > > > > > > > we
> > > > > > > > > > > > could
> > > > > > > > > > > > >> >> limit
> > > > > > > > > > > > >> >> > > the
> > > > > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > > > > >> >> > > > > >> >> of affection to the producer
> only,
> > > > which
> > > > > is
> > > > > > > > > > important
> > > > > > > > > > > > >> since
> > > > > > > > > > > > >> >> we
> > > > > > > > > > > > >> >> > > > don't
> > > > > > > > > > > > >> >> > > > > >> want
> > > > > > > > > > > > >> >> > > > > >> >> shared exception types to
> > implement a
> > > > > > > > > > producer-related
> > > > > > > > > > > > >> >> > interface,
> > > > > > > > > > > > >> >> > > > > such
> > > > > > > > > > > > >> >> > > > > >> >> as
> > UnknownTopicOrPartitionException.
> > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM
> > > Matthias
> > > > > J.
> > > > > > > Sax
> > > > > > > > <
> > > > > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > >> >> > > > > >> >>> Overall, categorizing exceptions
> > > > makes a
> > > > > > lot
> > > > > > > > of
> > > > > > > > > > sense.
> > > > > > > > > > > > >> As I
> > > > > > > > > > > > >> >> > > don't
> > > > > > > > > > > > >> >> > > > > know
> > > > > > > > > > > > >> >> > > > > >> >>> the producer internals well
> > enough,
> > > I
> > > > > > cannot
> > > > > > > > > > comment
> > > > > > > > > > > > on
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> >> > > > > >> >>> categorization in detail though.
> > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > >> >> > > > > >> >>> What I am wondering is, if we
> > should
> > > > > > > introduce
> > > > > > > > > an
> > > > > > > > > > > > >> exception
> > > > > > > > > > > > >> >> > > > > interface
> > > > > > > > > > > > >> >> > > > > >> >>> that non-fatal exception would
> > > > implement
> > > > > > > > instead
> > > > > > > > > > of
> > > > > > > > > > > > >> >> creating a
> > > > > > > > > > > > >> >> > > new
> > > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal
> > exceptions?
> > > > > What
> > > > > > > > would
> > > > > > > > > > be the
> > > > > > > > > > > > >> >> > pros/cons
> > > > > > > > > > > > >> >> > > > for
> > > > > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen
> > > > wrote:
> > > > > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > >> >> > > > > >> >>>> I would like to start a
> > discussion
> > > > > thread
> > > > > > > for
> > > > > > > > > > > > KIP-691:
> > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> >
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify
> the
> > > > > > exception
> > > > > > > > > > handling
> > > > > > > > > > > > >> logic
> > > > > > > > > > > > >> >> > for
> > > > > > > > > > > > >> >> > > > > >> >>>> transactional Producer users by
> > > > > > classifying
> > > > > > > > > > fatal and
> > > > > > > > > > > > >> >> > non-fatal
> > > > > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > > > > >> >> > > > > >> >>>> and throw them correspondingly
> > for
> > > > > easier
> > > > > > > > catch
> > > > > > > > > > and
> > > > > > > > > > > > >> retry.
> > > > > > > > > > > > >> >> > Let
> > > > > > > > > > > > >> >> > > me
> > > > > > > > > > > > >> >> > > > > >> know
> > > > > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > > > --
> > > > > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > > > > >> >> > > >
> > > > > > > > > > > > >> >> > >
> > > > > > > > > > > > >> >> >
> > > > > > > > > > > > >> >> >
> > > > > > > > > > > > >> >> > --
> > > > > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > > > > >> >> >
> > > > > > > > > > > > >> >>
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > --
> > > > > > > > > > > > >> > -- Guozhang
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> --
> > > > > > > > > > > > >> -- Guozhang
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > -- Guozhang
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Boyang,

Do you think it's necessary to deprecate the old `onCompletion` callback? I
was thinking there's probably no harm leaving it around. Users might not
care about the failure type. Other than that, it looks good to me.

Thanks,
Jason

On Thu, Jan 28, 2021 at 6:58 PM Boyang Chen <re...@gmail.com>
wrote:

> Hey Guozhang,
>
> the CommitFailedException wrapping would cover all the non-fatal exceptions
> as we listed out in the KIP, generally speaking any exception that could
> recover safely by calling abortTxn should be wrapped.
>
> Best,
> Boyang
>
> On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hey Boyang,
> >
> > I think maybe there's something cracking here :) I'm just asking for
> > clarifications that as of today, which non-fatal exceptions the newly
> > introduced CommitFailedException would cover, and it seems to be only 1)
> > unknown pid, 2) invalid pid mapping, and 3) concurrent transactions. Is
> > that correct?
> >
> > On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Hey Guozhang, I think TimeoutException would not be covered here as it
> > > potentially has a risk of hitting an illegal state on the broker side
> > when
> > > the previous commit was actually successful. Users should try to
> increase
> > > their max.block.ms to avoid hitting the timeout as a base suggestion,
> > > which
> > > is discussed in the KIP.
> > >
> > > As for the umbrella exception, I agree it has values to some extent,
> but
> > in
> > > terms of adoption difficulty, additional exception types usually make
> the
> > > error handling more complex than simplified, and we are doing our best
> to
> > > avoid compatibility issues caused by wrapping previously thrown raw
> > > exceptions that invalidates user error handling. And I'm also not in
> > favor
> > > of our exception hierarchy today where all exceptions are subclasses of
> > > KafkaException, and we already mixed its handling with other specific
> > > exception types that could be caught as superclass KafkaException. That
> > > could be addressed in a separate proposal if we see necessary.
> > >
> > > Boyang
> > >
> > > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Thanks Jason for the suggestion, that looks good to me too.
> > > >
> > > > Regarding the non-fatal exceptions wrapped as CommitFailed, I would
> > like
> > > to
> > > > clarify if we would cover all the following cases: 1) timeout, 2)
> > unknown
> > > > pid, 3) invalid pid mapping, 4) concurrent transactions?
> > > >
> > > > BTW I think it still makes sense to use an umbrella exception in case
> > in
> > > > the future we add more non-fatal cases even though today we only
> have a
> > > > few.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks Jason, I agree with the proposed solution here, will update
> > the
> > > > KIP.
> > > > >
> > > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <
> jason@confluent.io
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Boyang,
> > > > > >
> > > > > > It seems like a reasonable suggestion. I wonder if a flag is
> > > sufficient
> > > > > > though. The current `Callback` documentation treats "fatal"
> errors
> > > from
> > > > > the
> > > > > > perspective of the individual message that was sent.
> > > > > >
> > > > > > ```
> > > > > >     *                  Non-Retriable exceptions (fatal, the
> message
> > > > will
> > > > > > never be sent):
> > > > > > ```
> > > > > >
> > > > > > However, we also have fatal errors from the perspective of the
> > > > > transaction
> > > > > > (e.g. when the producer gets fenced). Perhaps that suggests we
> need
> > > > > > something richer than a boolean flag:
> > > > > >
> > > > > > At a high level, I think the following cases are possible:
> > > > > >
> > > > > > - message rejected (e.g. message too large, invalid topic)
> > > > > > - delivery failed after retries/delivery timeout (e.g. timeout,
> crc
> > > > > error,
> > > > > > not enough replicas)
> > > > > > - transaction failed (e.g. producer fenced, invalid transaction
> > > state)
> > > > > >
> > > > > > Perhaps instead we can have a type like the following:
> > > > > >
> > > > > > class SendFailure {
> > > > > >   FailureType failureType;
> > > > > >   Exception cause;
> > > > > > }
> > > > > >
> > > > > > enum FailureType {
> > > > > >   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> > > > > > }
> > > > > >
> > > > > > (Not married to any of these names, just a starting point.)
> > > > > >
> > > > > > Then we add a new `onCompletion` as you've suggested:
> > > > > >
> > > > > > default void onCompletion(RecordMetadata metadata, SendFailure
> > > > failure) {
> > > > > >   this.onCompletion(metadata, failure.cause());
> > > > > > }
> > > > > >
> > > > > > This would give streams and other applications enough information
> > to
> > > > know
> > > > > > whether the message can be retried and whether the transaction
> can
> > be
> > > > > > aborted.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > >
> > > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > > > reluctanthero104@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks Jason for the thoughts.
> > > > > > >
> > > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <
> > > jason@confluent.io
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Boyang,
> > > > > > > >
> > > > > > > > Thanks for the iterations here. I think this is something we
> > > should
> > > > > > have
> > > > > > > > done a long time ago. It sounds like there are two API
> changes
> > > > here:
> > > > > > > >
> > > > > > > > 1. We are introducing the `CommitFailedException` to wrap
> > > abortable
> > > > > > > > errors that are raised from `commitTransaction`. This sounds
> > fine
> > > > to
> > > > > > me.
> > > > > > > As
> > > > > > > > far as I know, the only case we might need this is when we
> add
> > > > > support
> > > > > > to
> > > > > > > > let producers recover from coordinator timeouts. Are there
> any
> > > > > others?
> > > > > > > >
> > > > > > > > I think the purpose here is to ensure non-fatal exceptions
> are
> > > > > unified
> > > > > > > under the same
> > > > > > > exception umbrella, to make the proceeding to abort any ongoing
> > > > > > transaction
> > > > > > > much easier.
> > > > > > > I don't think `coordinator timeouts` is the only case to
> recover
> > > > here,
> > > > > > > since we have other
> > > > > > > non-fatal exceptions such as UnknownPid.
> > > > > > >
> > > > > > > 2. We are wrapping non-fatal errors raised from `send` as
> > > > > > `KafkaException`.
> > > > > > > > The motivation for this is less clear to me and it doesn't
> look
> > > > like
> > > > > > the
> > > > > > > > example from the KIP depends on it. My concern here is
> > > > compatibility.
> > > > > > > > Currently we have the following documentation for the
> > `Callback`
> > > > API:
> > > > > > > >
> > > > > > > > ```
> > > > > > > >      *                  Non-Retriable exceptions (fatal, the
> > > > message
> > > > > > will
> > > > > > > > never be sent):
> > > > > > > >      *
> > > > > > > >      *                  InvalidTopicException
> > > > > > > >      *                  OffsetMetadataTooLargeException
> > > > > > > >      *                  RecordBatchTooLargeException
> > > > > > > >      *                  RecordTooLargeException
> > > > > > > >      *                  UnknownServerException
> > > > > > > >      *                  UnknownProducerIdException
> > > > > > > >      *                  InvalidProducerEpochException
> > > > > > > >      *
> > > > > > > >      *                  Retriable exceptions (transient, may
> be
> > > > > covered
> > > > > > > by
> > > > > > > > increasing #.retries):
> > > > > > > >      *
> > > > > > > >      *                  CorruptRecordException
> > > > > > > >      *                  InvalidMetadataException
> > > > > > > >      *                  NotEnoughReplicasAfterAppendException
> > > > > > > >      *                  NotEnoughReplicasException
> > > > > > > >      *                  OffsetOutOfRangeException
> > > > > > > >      *                  TimeoutException
> > > > > > > >      *                  UnknownTopicOrPartitionException
> > > > > > > > ```
> > > > > > > >
> > > > > > > > If we wrap all the retriable exceptions documented here as
> > > > > > > > `KafkaException`, wouldn't that break any error handling that
> > > users
> > > > > > might
> > > > > > > > already have? it's gonna introduce a compatibility issue.
> > > > > > > >
> > > > > > > > The original intention was to simplify `send` callback error
> > > > handling
> > > > > > by
> > > > > > > doing exception wrapping, as on Streams level
> > > > > > > we have to prepare an exhausting list of exceptions to catch as
> > > > fatal,
> > > > > > and
> > > > > > > the same lengthy list to catch as
> > > > > > > non-fatal. It would be much easier if we got `hints` from the
> > > > callback.
> > > > > > > However,
> > > > > > > I agree there is a compatibility concern, what about
> deprecating
> > > the
> > > > > > > existing:
> > > > > > >
> > > > > > > void onCompletion(RecordMetadata metadata, Exception exception)
> > > > > > >
> > > > > > > and replace it with:
> > > > > > >
> > > > > > > default void onCompletion(RecordMetadata metadata, Exception
> > > > exception,
> > > > > > > boolean isFatal) {
> > > > > > >   this.onCompletion(metadata, exception);
> > > > > > > }
> > > > > > >
> > > > > > > to make sure new users get the benefit of understanding the
> > > fatality
> > > > > > based
> > > > > > > on the info presented by the producer?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <
> info@hiringuru.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Why  we are receiving all emails kindly remove us from
> > > > > > > > > dev@kafka.apache.org we don't want to receive emails
> > anymore.
> > > > > > > > >
> > > > > > > > > Thanks
> > > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com>
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks Boyang, yes I think I was confused about the
> > different
> > > > > > > handling
> > > > > > > > of
> > > > > > > > > > two abortTxn calls, and now I get it was not
> intentional. I
> > > > > think I
> > > > > > > do
> > > > > > > > > not
> > > > > > > > > > have more concerns.
> > > > > > > > > >
> > > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Thanks for the clarification Guozhang, I got your point
> > > that
> > > > we
> > > > > > > want
> > > > > > > > to
> > > > > > > > > > > have a consistent handling of fatal exceptions being
> > thrown
> > > > > from
> > > > > > > the
> > > > > > > > > > > abortTxn. I modified the current template to move the
> > fatal
> > > > > > > exception
> > > > > > > > > > > try-catch outside of the processing loop to make sure
> we
> > > > could
> > > > > > get
> > > > > > > a
> > > > > > > > > chance
> > > > > > > > > > > to close consumer/producer modules. Let me know what
> you
> > > > think.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Boyang
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > My understanding is that abortTransaction would only
> > > throw
> > > > > when
> > > > > > > the
> > > > > > > > > > > > producer is in fatal state. For CommitFailed, the
> > > producer
> > > > > > should
> > > > > > > > > still
> > > > > > > > > > > be
> > > > > > > > > > > > in the abortable error state, so that
> abortTransaction
> > > call
> > > > > > would
> > > > > > > > not
> > > > > > > > > > > throw.
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <
> > > > > > > wangguoz@gmail.com
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> Or are you going to maintain some internal state
> such
> > > > that,
> > > > > > the
> > > > > > > > > > > >> `abortTransaction` in the catch block would never
> > throw
> > > > > again?
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > I've also thought about this (i.e. using
> > CommitFailed
> > > > for
> > > > > > all
> > > > > > > > > > > >> non-fatal),
> > > > > > > > > > > >> > but what I'm pondering is that, in the catch
> > > > > (CommitFailed)
> > > > > > > > block,
> > > > > > > > > > > what
> > > > > > > > > > > >> > would happen if the `producer.abortTransaction();`
> > > > throws
> > > > > > > again?
> > > > > > > > > > > should
> > > > > > > > > > > >> > that be captured as a fatal and cause the client
> to
> > > > close
> > > > > > > again.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > If yes, then naively the pattern would be:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > ...
> > > > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > > > >> >         // Transaction commit failed with
> abortable
> > > > error,
> > > > > > > user
> > > > > > > > > could
> > > > > > > > > > > >> reset
> > > > > > > > > > > >> >         // the application state and resume with a
> > new
> > > > > > > > > transaction.
> > > > > > > > > > > The
> > > > > > > > > > > >> > root
> > > > > > > > > > > >> >         // cause was wrapped in the thrown
> > exception.
> > > > > > > > > > > >> >         resetToLastCommittedPositions(consumer);
> > > > > > > > > > > >> >         try {
> > > > > > > > > > > >> >             producer.abortTransaction();
> > > > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > > > >> >             producer.close();
> > > > > > > > > > > >> >             consumer.close();
> > > > > > > > > > > >> >             throw e;
> > > > > > > > > > > >> >         }
> > > > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > > > >> >         producer.close();
> > > > > > > > > > > >> >         consumer.close();
> > > > > > > > > > > >> >         throw e;
> > > > > > > > > > > >> >     }
> > > > > > > > > > > >> > ...
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Guozhang
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > > > >> > wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> Jason and I were discussing the new API offline
> and
> > > > > decided
> > > > > > > to
> > > > > > > > > take
> > > > > > > > > > > >> >> another
> > > > > > > > > > > >> >> approach. Firstly, the reason not to invent a new
> > API
> > > > > with
> > > > > > > > > returned
> > > > > > > > > > > >> >> boolean
> > > > > > > > > > > >> >> flag is for compatibility consideration, since
> old
> > > EOS
> > > > > code
> > > > > > > > > would not
> > > > > > > > > > > >> know
> > > > > > > > > > > >> >> that a given transaction commit was failed
> > internally
> > > > as
> > > > > > they
> > > > > > > > > don't
> > > > > > > > > > > >> listen
> > > > > > > > > > > >> >> to the output flag. Our proposed alternative
> > solution
> > > > is
> > > > > to
> > > > > > > let
> > > > > > > > > > > >> >> *commitTransaction
> > > > > > > > > > > >> >> throw CommitFailedException whenever the commit
> > > failed
> > > > > with
> > > > > > > > > non-fatal
> > > > > > > > > > > >> >> exception*, so that on the caller side the
> handling
> > > > logic
> > > > > > > > > becomes:
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> try {
> > > > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > > > >> >>             producer.commitTransaction();
> > > > > > > > > > > >> >>         } else {
> > > > > > > > > > > >> >>
> >  resetToLastCommittedPositions(consumer);
> > > > > > > > > > > >> >>             producer.abortTransaction();
> > > > > > > > > > > >> >>         }
> > > > > > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > > > > > >> >>         // Transaction commit failed with
> abortable
> > > > > error,
> > > > > > > user
> > > > > > > > > could
> > > > > > > > > > > >> >> reset
> > > > > > > > > > > >> >>         // the application state and resume with
> a
> > > new
> > > > > > > > > transaction.
> > > > > > > > > > > The
> > > > > > > > > > > >> >> root
> > > > > > > > > > > >> >>         // cause was wrapped in the thrown
> > exception.
> > > > > > > > > > > >> >>         resetToLastCommittedPositions(consumer);
> > > > > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > > > >> >>         producer.close();
> > > > > > > > > > > >> >>         consumer.close();
> > > > > > > > > > > >> >>         throw e;
> > > > > > > > > > > >> >>     }
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> This approach looks cleaner as all exception
> types
> > > > other
> > > > > > than
> > > > > > > > > > > >> CommitFailed
> > > > > > > > > > > >> >> will doom to be fatal, which is very easy to
> adopt
> > > for
> > > > > > users.
> > > > > > > > In
> > > > > > > > > the
> > > > > > > > > > > >> >> meantime, we still maintain the commitTxn
> behavior
> > to
> > > > > throw
> > > > > > > > > instead
> > > > > > > > > > > of
> > > > > > > > > > > >> >> silently failing.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> In addition, we decided to drop the
> recommendation
> > to
> > > > > > handle
> > > > > > > > > > > >> >> TimeoutException and leave it to the users to
> make
> > > the
> > > > > > call.
> > > > > > > > The
> > > > > > > > > > > >> downside
> > > > > > > > > > > >> >> for blindly calling abortTxn upon timeout is that
> > we
> > > > > could
> > > > > > > > > result in
> > > > > > > > > > > an
> > > > > > > > > > > >> >> illegal state when the commit was already
> > successful
> > > on
> > > > > the
> > > > > > > > > broker
> > > > > > > > > > > >> >> side. Without a good guarantee on the outcome,
> > > > > > > overcomplicating
> > > > > > > > > the
> > > > > > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> Let me know your thoughts on the new approach
> here,
> > > > thank
> > > > > > > you!
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> Best,
> > > > > > > > > > > >> >> Boyang
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <
> > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > >> >> wrote:
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >> > Thanks for your clarification on 2)/3), that
> > makes
> > > > > sense.
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > > > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > > Thanks for the input Guozhang, replied
> inline.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang
> Wang <
> > > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > > >> >> > wrote:
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Thanks for the updated KIP. I read it again
> > and
> > > > > have
> > > > > > > the
> > > > > > > > > > > >> following
> > > > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > 0. I'm a bit concerned that if commitTxn
> does
> > > not
> > > > > > throw
> > > > > > > > any
> > > > > > > > > > > >> >> non-fatal
> > > > > > > > > > > >> >> > > > exception, and instead we rely on the
> > > subsequent
> > > > > > > beginTxn
> > > > > > > > > call
> > > > > > > > > > > to
> > > > > > > > > > > >> >> > throw,
> > > > > > > > > > > >> >> > > it
> > > > > > > > > > > >> >> > > > may violate some callers with a pattern
> that
> > > > > relying
> > > > > > on
> > > > > > > > > > > >> commitTxn to
> > > > > > > > > > > >> >> > > > succeed to make some non-rollback
> operations.
> > > For
> > > > > > > > example:
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > > > >> >> > > > // if commitTxn succeeds, then commit the
> DB
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > -------------
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > The issue is that, committing DB is a
> > > > non-rollback
> > > > > > > > > operation,
> > > > > > > > > > > and
> > > > > > > > > > > >> >> users
> > > > > > > > > > > >> >> > > may
> > > > > > > > > > > >> >> > > > just rely on commitTxn to return without
> > error
> > > to
> > > > > > make
> > > > > > > > this
> > > > > > > > > > > >> >> > non-rollback
> > > > > > > > > > > >> >> > > > call. Of course we can just claim this
> > pattern
> > > is
> > > > > not
> > > > > > > > > > > legitimate
> > > > > > > > > > > >> >> and is
> > > > > > > > > > > >> >> > > not
> > > > > > > > > > > >> >> > > > the right way of doing things, but many
> users
> > > may
> > > > > > > > naturally
> > > > > > > > > > > adopt
> > > > > > > > > > > >> >> this
> > > > > > > > > > > >> >> > > > pattern.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > So maybe we should still let commitTxn also
> > > throw
> > > > > > > > non-fatal
> > > > > > > > > > > >> >> exceptions,
> > > > > > > > > > > >> >> > > in
> > > > > > > > > > > >> >> > > > which case we would then call abortTxn
> again.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > But if we do this, the pattern becomes:
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > > > >> >> > > >     try {
> > > > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > > > >> >> > > >     } catch (...) {        // enumerate all
> > > fatal
> > > > > > > > > exceptions
> > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Thank makes me think, the alternative we
> have
> > > > > > discussed
> > > > > > > > > offline
> > > > > > > > > > > >> may
> > > > > > > > > > > >> >> be
> > > > > > > > > > > >> >> > > > better: let commitTxn() to return a boolean
> > > flag.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > * If it returns true, it means the commit
> > > > > succeeded.
> > > > > > > > Users
> > > > > > > > > can
> > > > > > > > > > > >> >> > > comfortably
> > > > > > > > > > > >> >> > > > continue and do any external non-rollback
> > > > > operations
> > > > > > if
> > > > > > > > > they
> > > > > > > > > > > >> like.
> > > > > > > > > > > >> >> > > > * If it returns false, it means the commit
> > > failed
> > > > > > with
> > > > > > > > > > > non-fatal
> > > > > > > > > > > >> >> error
> > > > > > > > > > > >> >> > > *AND
> > > > > > > > > > > >> >> > > > the txn has been aborted*. Users do not
> need
> > to
> > > > > call
> > > > > > > > > abortTxn
> > > > > > > > > > > >> again.
> > > > > > > > > > > >> >> > > > * If it throws, then it means fatal errors.
> > > Users
> > > > > > > should
> > > > > > > > > shut
> > > > > > > > > > > >> down
> > > > > > > > > > > >> >> the
> > > > > > > > > > > >> >> > > > client.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > try {
> > > > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > > > > > > >> >> > > >     } else {
> > > > > > > > > > > >> >> > > >         // reset offsets, rollback
> > operations,
> > > > etc
> > > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > > >> >> > > >     }
> > > > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > > > >> >> > > > }
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > > > >> >> > > >    // do other non-rollbackable things
> > > > > > > > > > > >> >> > > > else
> > > > > > > > > > > >> >> > > >    // reset offsets, rollback operations,
> etc
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Of course, if we want to have better
> > visibility
> > > > > into
> > > > > > > what
> > > > > > > > > > > caused
> > > > > > > > > > > >> the
> > > > > > > > > > > >> >> > > commit
> > > > > > > > > > > >> >> > > > to fail and txn to abort. We can let the
> > return
> > > > > type
> > > > > > be
> > > > > > > > an
> > > > > > > > > enum
> > > > > > > > > > > >> >> instead
> > > > > > > > > > > >> >> > > of
> > > > > > > > > > > >> >> > > > a flag. But my main idea is to still let
> the
> > > > > > commitTxn
> > > > > > > be
> > > > > > > > > the
> > > > > > > > > > > >> final
> > > > > > > > > > > >> >> > point
> > > > > > > > > > > >> >> > > > users can tell whether this txn succeeded
> or
> > > not,
> > > > > > > instead
> > > > > > > > > of
> > > > > > > > > > > >> >> relying on
> > > > > > > > > > > >> >> > > the
> > > > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > I agree that adding a boolean flag is
> indeed
> > > > useful
> > > > > > in
> > > > > > > > this
> > > > > > > > > > > case.
> > > > > > > > > > > >> >> Will
> > > > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 1. Re: "while maintaining the behavior to
> throw
> > > > fatal
> > > > > > > > > exception
> > > > > > > > > > > in
> > > > > > > > > > > >> >> raw"
> > > > > > > > > > > >> >> > not
> > > > > > > > > > > >> >> > > > sure what do you mean by "throw" here. Are
> > you
> > > > > > > proposing
> > > > > > > > > the
> > > > > > > > > > > >> >> callback
> > > > > > > > > > > >> >> > > would
> > > > > > > > > > > >> >> > > > *pass* (not throw) in any fatal exceptions
> > when
> > > > > > > triggered
> > > > > > > > > > > without
> > > > > > > > > > > >> >> > > wrapping?
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Yes, I want to say *pass*, the benefit is
> to
> > > make
> > > > > the
> > > > > > > end
> > > > > > > > > > > user's
> > > > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > > 2. I'm not sure I understand the claim
> > > regarding
> > > > > the
> > > > > > > > > callback
> > > > > > > > > > > >> that
> > > > > > > > > > > >> >> "In
> > > > > > > > > > > >> >> > > EOS
> > > > > > > > > > > >> >> > > > setup, it is not required to handle the
> > > > exception".
> > > > > > Are
> > > > > > > > you
> > > > > > > > > > > >> >> proposing
> > > > > > > > > > > >> >> > > that,
> > > > > > > > > > > >> >> > > > e.g. in Streams, we do not try to handle
> any
> > > > > > exceptions
> > > > > > > > if
> > > > > > > > > EOS
> > > > > > > > > > > is
> > > > > > > > > > > >> >> > enabled
> > > > > > > > > > > >> >> > > > in the callback anymore, but just let
> > > commitTxn()
> > > > > > > itself
> > > > > > > > to
> > > > > > > > > > > fail
> > > > > > > > > > > >> to
> > > > > > > > > > > >> >> be
> > > > > > > > > > > >> >> > > > notified about the problem? Personally I
> > think
> > > in
> > > > > > > Streams
> > > > > > > > > we
> > > > > > > > > > > >> should
> > > > > > > > > > > >> >> > just
> > > > > > > > > > > >> >> > > > make the handling logic of the callback to
> be
> > > > > > > consistent
> > > > > > > > > > > >> regardless
> > > > > > > > > > > >> >> of
> > > > > > > > > > > >> >> > > the
> > > > > > > > > > > >> >> > > > EOS settings (today we have different logic
> > > > > depending
> > > > > > > on
> > > > > > > > > this
> > > > > > > > > > > >> logic,
> > > > > > > > > > > >> >> > > which
> > > > > > > > > > > >> >> > > > I think could be unified as well).
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > My idea originates from the claim on send
> > API:
> > > > > > > > > > > >> >> > > "When used as part of a transaction, it is
> not
> > > > > > necessary
> > > > > > > to
> > > > > > > > > > > define
> > > > > > > > > > > >> a
> > > > > > > > > > > >> >> > > callback or check the result of the future
> in
> > > > order
> > > > > to
> > > > > > > > > detect
> > > > > > > > > > > >> errors
> > > > > > > > > > > >> >> > from
> > > > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > > > >> >> > > My understanding is that for EOS, the
> exception
> > > > will
> > > > > be
> > > > > > > > > detected
> > > > > > > > > > > by
> > > > > > > > > > > >> >> > calling
> > > > > > > > > > > >> >> > > transactional APIs either way, so it is a
> > > duplicate
> > > > > > > > handling
> > > > > > > > > to
> > > > > > > > > > > >> track
> > > > > > > > > > > >> >> > > all the sendExceptions in RecordCollector.
> > > > However, I
> > > > > > > > looked
> > > > > > > > > up
> > > > > > > > > > > >> >> > > sendException is being used today as follow:
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for
> any
> > > > > default
> > > > > > > or
> > > > > > > > > > > >> customized
> > > > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > > > >> >> > > 2. Stop collecting offset info or new
> > exceptions
> > > > > > > > > > > >> >> > > 3. Check and rethrow exceptions in close(),
> > > flush()
> > > > > or
> > > > > > > new
> > > > > > > > > send()
> > > > > > > > > > > >> >> calls
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > To my understanding, we should still honor
> the
> > > > > > commitment
> > > > > > > > to
> > > > > > > > > #1
> > > > > > > > > > > for
> > > > > > > > > > > >> >> any
> > > > > > > > > > > >> >> > > user customized implementation. The #2 does
> not
> > > > > really
> > > > > > > > > affect our
> > > > > > > > > > > >> >> > decision
> > > > > > > > > > > >> >> > > upon EOS. The #3 here is still valuable as it
> > > could
> > > > > > help
> > > > > > > us
> > > > > > > > > fail
> > > > > > > > > > > >> fast
> > > > > > > > > > > >> >> in
> > > > > > > > > > > >> >> > > new send() instead of waiting to later stage
> of
> > > > > > > processing.
> > > > > > > > > In
> > > > > > > > > > > that
> > > > > > > > > > > >> >> > sense,
> > > > > > > > > > > >> >> > > I agree we should continue to record send
> > > > exceptions
> > > > > > even
> > > > > > > > > under
> > > > > > > > > > > EOS
> > > > > > > > > > > >> >> case
> > > > > > > > > > > >> >> > to
> > > > > > > > > > > >> >> > > ensure the strength of stream side Producer
> > > logic.
> > > > On
> > > > > > the
> > > > > > > > > safer
> > > > > > > > > > > >> side,
> > > > > > > > > > > >> >> we
> > > > > > > > > > > >> >> > no
> > > > > > > > > > > >> >> > > longer need to wrap certain fatal exceptions
> > like
> > > > > > > > > ProducerFenced
> > > > > > > > > > > as
> > > > > > > > > > > >> >> > > TaskMigrated, since they should not crash the
> > > > stream
> > > > > > > thread
> > > > > > > > > if
> > > > > > > > > > > >> thrown
> > > > > > > > > > > >> >> in
> > > > > > > > > > > >> >> > > raw format, once we adopt the new processing
> > > model
> > > > in
> > > > > > the
> > > > > > > > > send
> > > > > > > > > > > >> phase.
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang
> Chen <
> > > > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > > Thanks for everyone's feedback so far. I
> > have
> > > > > > > polished
> > > > > > > > > the
> > > > > > > > > > > KIP
> > > > > > > > > > > >> >> after
> > > > > > > > > > > >> >> > > > > offline discussion with some folks
> working
> > on
> > > > EOS
> > > > > > to
> > > > > > > > > make the
> > > > > > > > > > > >> >> > exception
> > > > > > > > > > > >> >> > > > > handling more lightweight. The essential
> > > change
> > > > > is
> > > > > > > that
> > > > > > > > > we
> > > > > > > > > > > are
> > > > > > > > > > > >> not
> > > > > > > > > > > >> >> > > > > inventing a new intermediate exception
> > type,
> > > > but
> > > > > > > > instead
> > > > > > > > > > > >> >> separating a
> > > > > > > > > > > >> >> > > > full
> > > > > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > >> >> > > > > This way for any exception thrown from
> > phase
> > > 1,
> > > > > > will
> > > > > > > be
> > > > > > > > > an
> > > > > > > > > > > >> >> indicator
> > > > > > > > > > > >> >> > in
> > > > > > > > > > > >> >> > > > > phase 2 whether we should do commit or
> > abort,
> > > > and
> > > > > > > from
> > > > > > > > > now on
> > > > > > > > > > > >> >> > > > > `commitTransaction` should only throw
> fatal
> > > > > > > exceptions,
> > > > > > > > > > > >> similar to
> > > > > > > > > > > >> >> > > > > `abortTransaction`, so that any
> > > KafkaException
> > > > > > caught
> > > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > >> >> commit
> > > > > > > > > > > >> >> > > phase
> > > > > > > > > > > >> >> > > > > will be definitely fatal to crash the
> app.
> > > For
> > > > > more
> > > > > > > > > advanced
> > > > > > > > > > > >> users
> > > > > > > > > > > >> >> > such
> > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > >> >> > > > > Streams, we have the ability to further
> > wrap
> > > > > > selected
> > > > > > > > > types
> > > > > > > > > > > of
> > > > > > > > > > > >> >> fatal
> > > > > > > > > > > >> >> > > > > exceptions to trigger task migration if
> > > > > necessary.
> > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > >> >> > > > > More details in the KIP, feel free to
> take
> > > > > another
> > > > > > > > look,
> > > > > > > > > > > >> thanks!
> > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang
> > Chen <
> > > > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno
> > > Cadonna
> > > > <
> > > > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >> Like Matthias, I do also not know the
> > > > producer
> > > > > > > > > internal
> > > > > > > > > > > well
> > > > > > > > > > > >> >> > enough
> > > > > > > > > > > >> >> > > to
> > > > > > > > > > > >> >> > > > > >> comment on the categorization.
> However,
> > I
> > > > > think
> > > > > > > > > having a
> > > > > > > > > > > >> super
> > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > >> >> > > > > >> (e.g. RetriableException) that
> encodes
> > if
> > > > an
> > > > > > > > > exception is
> > > > > > > > > > > >> >> fatal
> > > > > > > > > > > >> >> > or
> > > > > > > > > > > >> >> > > > not
> > > > > > > > > > > >> >> > > > > >> is cleaner, better understandable and
> > less
> > > > > > > > > error-prone,
> > > > > > > > > > > >> because
> > > > > > > > > > > >> >> > > > ideally
> > > > > > > > > > > >> >> > > > > >> when you add a new non-fatal exception
> > in
> > > > > future
> > > > > > > you
> > > > > > > > > just
> > > > > > > > > > > >> need
> > > > > > > > > > > >> >> to
> > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > >> >> > > > > >> about letting it inherit from the
> super
> > > > > > exception
> > > > > > > > and
> > > > > > > > > all
> > > > > > > > > > > >> the
> > > > > > > > > > > >> >> rest
> > > > > > > > > > > >> >> > > of
> > > > > > > > > > > >> >> > > > > >> the code will just behave correctly
> > > without
> > > > > the
> > > > > > > need
> > > > > > > > > to
> > > > > > > > > > > wrap
> > > > > > > > > > > >> >> the
> > > > > > > > > > > >> >> > new
> > > > > > > > > > > >> >> > > > > >> exception into another exception each
> > time
> > > > it
> > > > > is
> > > > > > > > > thrown
> > > > > > > > > > > >> (maybe
> > > > > > > > > > > >> >> it
> > > > > > > > > > > >> >> > is
> > > > > > > > > > > >> >> > > > > >> thrown at different location in the
> > code).
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >> As far as I understand the following
> > > > statement
> > > > > > > from
> > > > > > > > > your
> > > > > > > > > > > >> >> previous
> > > > > > > > > > > >> >> > > > e-mail
> > > > > > > > > > > >> >> > > > > >> is the reason that currently such a
> > super
> > > > > > > exception
> > > > > > > > > is not
> > > > > > > > > > > >> >> > possible:
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >> "Right now we have RetriableException
> > > type,
> > > > if
> > > > > > we
> > > > > > > > are
> > > > > > > > > > > going
> > > > > > > > > > > >> to
> > > > > > > > > > > >> >> > add a
> > > > > > > > > > > >> >> > > > > >> `ProducerRetriableException` type, we
> > have
> > > > to
> > > > > > put
> > > > > > > > > this new
> > > > > > > > > > > >> >> > interface
> > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > >> >> > > > > >> the parent of the RetriableException,
> > > > because
> > > > > > not
> > > > > > > > all
> > > > > > > > > > > thrown
> > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > >> >> > > > > >> exceptions are `retriable` in general
> > for
> > > > > > > producer"
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >> In the list of exceptions in your
> KIP, I
> > > > found
> > > > > > > > > non-fatal
> > > > > > > > > > > >> >> > exceptions
> > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > >> >> > > > > >> do not inherit from
> RetriableException.
> > I
> > > > > guess
> > > > > > > > those
> > > > > > > > > are
> > > > > > > > > > > >> the
> > > > > > > > > > > >> >> ones
> > > > > > > > > > > >> >> > > you
> > > > > > > > > > > >> >> > > > > >> are referring to in your statement:
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >> All of those exceptions are non-fatal
> > and
> > > do
> > > > > not
> > > > > > > > > inherit
> > > > > > > > > > > >> from
> > > > > > > > > > > >> >> > > > > >> RetriableException. Is there a reason
> > for
> > > > > that?
> > > > > > If
> > > > > > > > > they
> > > > > > > > > > > >> >> depended
> > > > > > > > > > > >> >> > > from
> > > > > > > > > > > >> >> > > > > >> RetriableException we would be a bit
> > > closer
> > > > > to a
> > > > > > > > super
> > > > > > > > > > > >> >> exception I
> > > > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >> The reason is that sender may catch
> > those
> > > > > > > exceptions
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > >> >> > > > > > ProduceResponse, and it currently does
> > > > infinite
> > > > > > > > > > > >> >> > > > > > retries on RetriableException. To make
> > sure
> > > > we
> > > > > > > could
> > > > > > > > > > > trigger
> > > > > > > > > > > >> the
> > > > > > > > > > > >> >> > > > > > abortTransaction() by catching
> non-fatal
> > > > thrown
> > > > > > > > > > > >> >> > > > > > exceptions and reinitialize the txn
> > state,
> > > we
> > > > > > chose
> > > > > > > > > not to
> > > > > > > > > > > >> let
> > > > > > > > > > > >> >> > those
> > > > > > > > > > > >> >> > > > > > exceptions inherit RetriableException
> so
> > > that
> > > > > > > > > > > >> >> > > > > > they won't cause infinite retry on
> > sender.
> > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > > > >> >> > > > think
> > > > > > > > > > > >> >> > > > > >> to understand that their fatality
> > depends
> > > on
> > > > > the
> > > > > > > > type
> > > > > > > > > > > (i.e.
> > > > > > > > > > > >> >> > > > > >> configuration) of the producer. That
> > makes
> > > > it
> > > > > > > > > difficult to
> > > > > > > > > > > >> >> have a
> > > > > > > > > > > >> >> > > > super
> > > > > > > > > > > >> >> > > > > >> exception that encodes the
> retriability
> > as
> > > > > > > mentioned
> > > > > > > > > > > above.
> > > > > > > > > > > >> >> Would
> > > > > > > > > > > >> >> > it
> > > > > > > > > > > >> >> > > > be
> > > > > > > > > > > >> >> > > > > >> possible to introduce exceptions that
> > > > inherit
> > > > > > from
> > > > > > > > > > > >> >> > > RetriableException
> > > > > > > > > > > >> >> > > > > >> and that are thrown when those
> > exceptions
> > > > are
> > > > > > > caught
> > > > > > > > > from
> > > > > > > > > > > >> the
> > > > > > > > > > > >> >> > > brokers
> > > > > > > > > > > >> >> > > > > >> and the type of the producer is such
> > that
> > > > the
> > > > > > > > > exceptions
> > > > > > > > > > > are
> > > > > > > > > > > >> >> > > > retriable?
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >> Yea, I think in general the exception
> > type
> > > > > > mixing
> > > > > > > is
> > > > > > > > > > > >> difficult
> > > > > > > > > > > >> >> to
> > > > > > > > > > > >> >> > > get
> > > > > > > > > > > >> >> > > > > > them all right. I have already proposed
> > > > another
> > > > > > > > > solution
> > > > > > > > > > > >> based
> > > > > > > > > > > >> >> on
> > > > > > > > > > > >> >> > my
> > > > > > > > > > > >> >> > > > > > offline discussion with some folks
> > working
> > > on
> > > > > EOS
> > > > > > > > > > > >> >> > > > > > to make the handling more
> straightforward
> > > for
> > > > > end
> > > > > > > > users
> > > > > > > > > > > >> without
> > > > > > > > > > > >> >> the
> > > > > > > > > > > >> >> > > > need
> > > > > > > > > > > >> >> > > > > > to distinguish exception fatality.
> > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang
> wrote:
> > > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the proposal! I
> > made a
> > > > > pass
> > > > > > > over
> > > > > > > > > the
> > > > > > > > > > > >> list
> > > > > > > > > > > >> >> and
> > > > > > > > > > > >> >> > > here
> > > > > > > > > > > >> >> > > > > are
> > > > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> > 0) Although this is not part of the
> > > public
> > > > > > API,
> > > > > > > I
> > > > > > > > > think
> > > > > > > > > > > we
> > > > > > > > > > > >> >> > should
> > > > > > > > > > > >> >> > > > make
> > > > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > > > >> >> > > > > >> > that the suggested pattern (i.e.
> user
> > > can
> > > > > > always
> > > > > > > > > call
> > > > > > > > > > > >> >> abortTxn()
> > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > >> >> > > > > >> > handling non-fatal errors) are
> indeed
> > > > > > supported.
> > > > > > > > > E.g. if
> > > > > > > > > > > >> the
> > > > > > > > > > > >> >> txn
> > > > > > > > > > > >> >> > > is
> > > > > > > > > > > >> >> > > > > >> already
> > > > > > > > > > > >> >> > > > > >> > aborted by the broker side, then
> users
> > > can
> > > > > > still
> > > > > > > > > call
> > > > > > > > > > > >> >> abortTxn
> > > > > > > > > > > >> >> > > which
> > > > > > > > > > > >> >> > > > > >> would
> > > > > > > > > > > >> >> > > > > >> > not throw another exception but just
> > be
> > > > > > treated
> > > > > > > > as a
> > > > > > > > > > > >> no-op.
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> > 1)
> *ConcurrentTransactionsException*:
> > I
> > > > > think
> > > > > > > this
> > > > > > > > > error
> > > > > > > > > > > >> can
> > > > > > > > > > > >> >> > also
> > > > > > > > > > > >> >> > > be
> > > > > > > > > > > >> >> > > > > >> > returned but not documented yet.
> This
> > > > should
> > > > > > be
> > > > > > > a
> > > > > > > > > > > >> non-fatal
> > > > > > > > > > > >> >> > error.
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this
> > > error
> > > > is
> > > > > > > > > returned
> > > > > > > > > > > from
> > > > > > > > > > > >> >> > broker
> > > > > > > > > > > >> >> > > > when
> > > > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > > > >> >> > > > > >> > state transition failed (e.g. it is
> > > trying
> > > > > to
> > > > > > > > > transit to
> > > > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > > > >> >> > > > > >> > while the current state is not
> > > > > > prepare-commit).
> > > > > > > > This
> > > > > > > > > > > error
> > > > > > > > > > > >> >> could
> > > > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > > > >> >> > > > > >> > a bug on the client internal code or
> > the
> > > > > > broker
> > > > > > > > > code,
> > > > > > > > > > > OR a
> > > > > > > > > > > >> >> user
> > > > > > > > > > > >> >> > > > error
> > > > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > > > >> >> > > > > >> > similar error is
> > > > > > > ConcurrentTransactionsException,
> > > > > > > > > i.e.
> > > > > > > > > > > if
> > > > > > > > > > > >> >> Kafka
> > > > > > > > > > > >> >> > is
> > > > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > > > >> >> > > > > >> > these exceptions would only be
> > returned
> > > if
> > > > > > users
> > > > > > > > > try to
> > > > > > > > > > > do
> > > > > > > > > > > >> >> > > something
> > > > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn right after a
> > > > > commitTxn,
> > > > > > > > etc.
> > > > > > > > > So
> > > > > > > > > > > I'm
> > > > > > > > > > > >> >> > > thinking
> > > > > > > > > > > >> >> > > > it
> > > > > > > > > > > >> >> > > > > >> > should be a non-fatal error instead
> > of a
> > > > > fatal
> > > > > > > > > error,
> > > > > > > > > > > >> wdyt?
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i
> "indicates
> > > > fatal
> > > > > > > > > > > transactional
> > > > > > > > > > > >> >> > > sequence
> > > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit conflicting
> > with
> > > > the
> > > > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > > > >> >> > > > > >> that
> > > > > > > > > > > >> >> > > > > >> > is treated as non-fatal. I guess
> your
> > > > > proposal
> > > > > > > is
> > > > > > > > > that
> > > > > > > > > > > >> >> > > > > >> > OutOfOrderSequenceException would be
> > > > treated
> > > > > > > > either
> > > > > > > > > as
> > > > > > > > > > > >> fatal
> > > > > > > > > > > >> >> > with
> > > > > > > > > > > >> >> > > > > >> > transactional producer, or non-fatal
> > > with
> > > > > > > > idempotent
> > > > > > > > > > > >> >> producer,
> > > > > > > > > > > >> >> > is
> > > > > > > > > > > >> >> > > > that
> > > > > > > > > > > >> >> > > > > >> > right? If the producer is only
> > > configured
> > > > > with
> > > > > > > > > > > idempotency
> > > > > > > > > > > >> >> but
> > > > > > > > > > > >> >> > not
> > > > > > > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > > > >> >> > > for
> > > > > > > > > > > >> >> > > > > >> > non-fatal errors would be confusing
> > > since
> > > > > > users
> > > > > > > > are
> > > > > > > > > not
> > > > > > > > > > > >> using
> > > > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > > > >> >> > > > > >> > at all.. So I suggest we always
> throw
> > > > > > > > > > > >> OutOfSequenceException
> > > > > > > > > > > >> >> > as-is
> > > > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > > > >> >> > > > > >> > not wrapped) no matter how the
> > producer
> > > is
> > > > > > > > > configured,
> > > > > > > > > > > and
> > > > > > > > > > > >> >> let
> > > > > > > > > > > >> >> > the
> > > > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > > > >> >> > > > > >> > decide how to handle it based on
> > whether
> > > > it
> > > > > is
> > > > > > > > only
> > > > > > > > > > > >> >> idempotent
> > > > > > > > > > > >> >> > or
> > > > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the
> > > `send()`
> > > > > > > > callback /
> > > > > > > > > > > >> future
> > > > > > > > > > > >> >> can
> > > > > > > > > > > >> >> > > also
> > > > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > > > >> >> > > > > >> > txn-related exceptions, I think this
> > KIP
> > > > > > should
> > > > > > > > also
> > > > > > > > > > > cover
> > > > > > > > > > > >> >> this
> > > > > > > > > > > >> >> > > API
> > > > > > > > > > > >> >> > > > as
> > > > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> > 5) This is related to 1/2) above:
> > > > sometimes
> > > > > > > those
> > > > > > > > > > > >> non-fatal
> > > > > > > > > > > >> >> > errors
> > > > > > > > > > > >> >> > > > > like
> > > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are
> > not
> > > > due
> > > > > > to
> > > > > > > > the
> > > > > > > > > > > state
> > > > > > > > > > > >> >> being
> > > > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > > > >> >> > > > > >> > at the broker side, but maybe users
> > are
> > > > > doing
> > > > > > > > > something
> > > > > > > > > > > >> >> wrong.
> > > > > > > > > > > >> >> > So
> > > > > > > > > > > >> >> > > > I'm
> > > > > > > > > > > >> >> > > > > >> > wondering if we should further
> > > distinguish
> > > > > > those
> > > > > > > > > > > non-fatal
> > > > > > > > > > > >> >> > errors
> > > > > > > > > > > >> >> > > > > >> between
> > > > > > > > > > > >> >> > > > > >> > a) those that are caused by Kafka
> > > itself,
> > > > > > e.g. a
> > > > > > > > > broker
> > > > > > > > > > > >> timed
> > > > > > > > > > > >> >> > out
> > > > > > > > > > > >> >> > > > and
> > > > > > > > > > > >> >> > > > > >> > aborted a txn and later an endTxn
> > > request
> > > > is
> > > > > > > > > received,
> > > > > > > > > > > >> and b)
> > > > > > > > > > > >> >> > the
> > > > > > > > > > > >> >> > > > > user's
> > > > > > > > > > > >> >> > > > > >> > API call pattern is incorrect,
> causing
> > > the
> > > > > > > request
> > > > > > > > > to be
> > > > > > > > > > > >> >> > rejected
> > > > > > > > > > > >> >> > > > with
> > > > > > > > > > > >> >> > > > > >> an
> > > > > > > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > > > > >> >> > > > feels
> > > > > > > > > > > >> >> > > > > >> to
> > > > > > > > > > > >> >> > > > > >> > me more like for case a), but not
> case
> > > b).
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM
> Boyang
> > > > Chen <
> > > > > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your
> > proposal
> > > > > makes
> > > > > > > > sense
> > > > > > > > > as
> > > > > > > > > > > >> well,
> > > > > > > > > > > >> >> on
> > > > > > > > > > > >> >> > > the
> > > > > > > > > > > >> >> > > > > pro
> > > > > > > > > > > >> >> > > > > >> side
> > > > > > > > > > > >> >> > > > > >> >> we could have a universally agreed
> > > > > exception
> > > > > > > type
> > > > > > > > > to be
> > > > > > > > > > > >> >> caught
> > > > > > > > > > > >> >> > by
> > > > > > > > > > > >> >> > > > the
> > > > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > > > >> >> > > > > >> >> without having an extra layer on
> top
> > of
> > > > the
> > > > > > > > actual
> > > > > > > > > > > >> >> exceptions.
> > > > > > > > > > > >> >> > I
> > > > > > > > > > > >> >> > > > > could
> > > > > > > > > > > >> >> > > > > >> see
> > > > > > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > >> >> > > > > >> >> 1. The exception hierarchy will be
> > more
> > > > > > > complex.
> > > > > > > > > Right
> > > > > > > > > > > >> now
> > > > > > > > > > > >> >> we
> > > > > > > > > > > >> >> > > have
> > > > > > > > > > > >> >> > > > > >> >> RetriableException type, if we are
> > > going
> > > > to
> > > > > > > add a
> > > > > > > > > > > >> >> > > > > >> >> `ProducerRetriableException` type,
> we
> > > > have
> > > > > to
> > > > > > > put
> > > > > > > > > this
> > > > > > > > > > > >> new
> > > > > > > > > > > >> >> > > > interface
> > > > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > > > >> >> > > > > >> >> parent of the RetriableException,
> > > because
> > > > > not
> > > > > > > all
> > > > > > > > > > > thrown
> > > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > > >> >> > > > > >> >> exceptions are `retriable` in
> general
> > > for
> > > > > > > > > producer, for
> > > > > > > > > > > >> >> example
> > > > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > > > >> >> > > > > >> >> We could have an empty interface
> > > > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > > > >> >> > to
> > > > > > > > > > > >> >> > > > let
> > > > > > > > > > > >> >> > > > > >> all
> > > > > > > > > > > >> >> > > > > >> >> the thrown exceptions implement for
> > > sure,
> > > > > but
> > > > > > > > it's
> > > > > > > > > a
> > > > > > > > > > > bit
> > > > > > > > > > > >> >> > > unorthodox
> > > > > > > > > > > >> >> > > > > in
> > > > > > > > > > > >> >> > > > > >> the
> > > > > > > > > > > >> >> > > > > >> >> way we deal with exceptions in
> > general.
> > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > >> >> > > > > >> >> 2. There are cases where we throw a
> > > > > > > > KafkaException
> > > > > > > > > > > >> wrapping
> > > > > > > > > > > >> >> > > another
> > > > > > > > > > > >> >> > > > > >> >> KafkaException as either fatal or
> > > > > non-fatal.
> > > > > > If
> > > > > > > > we
> > > > > > > > > use
> > > > > > > > > > > an
> > > > > > > > > > > >> >> > > interface
> > > > > > > > > > > >> >> > > > > to
> > > > > > > > > > > >> >> > > > > >> >> solve #1, it is also required to
> > > > implement
> > > > > > > > another
> > > > > > > > > > > >> bloated
> > > > > > > > > > > >> >> > > > exception
> > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > >> >> > > > > >> >> which could replace KafkaException
> > > type,
> > > > as
> > > > > > we
> > > > > > > > > couldn't
> > > > > > > > > > > >> mark
> > > > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > >> >> > > > > >> >> 3. In terms of the encapsulation,
> > > > wrapping
> > > > > > > means
> > > > > > > > we
> > > > > > > > > > > could
> > > > > > > > > > > >> >> limit
> > > > > > > > > > > >> >> > > the
> > > > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > > > >> >> > > > > >> >> of affection to the producer only,
> > > which
> > > > is
> > > > > > > > > important
> > > > > > > > > > > >> since
> > > > > > > > > > > >> >> we
> > > > > > > > > > > >> >> > > > don't
> > > > > > > > > > > >> >> > > > > >> want
> > > > > > > > > > > >> >> > > > > >> >> shared exception types to
> implement a
> > > > > > > > > producer-related
> > > > > > > > > > > >> >> > interface,
> > > > > > > > > > > >> >> > > > > such
> > > > > > > > > > > >> >> > > > > >> >> as
> UnknownTopicOrPartitionException.
> > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM
> > Matthias
> > > > J.
> > > > > > Sax
> > > > > > > <
> > > > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > >> >> > > > > >> >>> Overall, categorizing exceptions
> > > makes a
> > > > > lot
> > > > > > > of
> > > > > > > > > sense.
> > > > > > > > > > > >> As I
> > > > > > > > > > > >> >> > > don't
> > > > > > > > > > > >> >> > > > > know
> > > > > > > > > > > >> >> > > > > >> >>> the producer internals well
> enough,
> > I
> > > > > cannot
> > > > > > > > > comment
> > > > > > > > > > > on
> > > > > > > > > > > >> the
> > > > > > > > > > > >> >> > > > > >> >>> categorization in detail though.
> > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > >> >> > > > > >> >>> What I am wondering is, if we
> should
> > > > > > introduce
> > > > > > > > an
> > > > > > > > > > > >> exception
> > > > > > > > > > > >> >> > > > > interface
> > > > > > > > > > > >> >> > > > > >> >>> that non-fatal exception would
> > > implement
> > > > > > > instead
> > > > > > > > > of
> > > > > > > > > > > >> >> creating a
> > > > > > > > > > > >> >> > > new
> > > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal
> exceptions?
> > > > What
> > > > > > > would
> > > > > > > > > be the
> > > > > > > > > > > >> >> > pros/cons
> > > > > > > > > > > >> >> > > > for
> > > > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen
> > > wrote:
> > > > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > >> >> > > > > >> >>>> I would like to start a
> discussion
> > > > thread
> > > > > > for
> > > > > > > > > > > KIP-691:
> > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >>
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify the
> > > > > exception
> > > > > > > > > handling
> > > > > > > > > > > >> logic
> > > > > > > > > > > >> >> > for
> > > > > > > > > > > >> >> > > > > >> >>>> transactional Producer users by
> > > > > classifying
> > > > > > > > > fatal and
> > > > > > > > > > > >> >> > non-fatal
> > > > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > > > >> >> > > > > >> >>>> and throw them correspondingly
> for
> > > > easier
> > > > > > > catch
> > > > > > > > > and
> > > > > > > > > > > >> retry.
> > > > > > > > > > > >> >> > Let
> > > > > > > > > > > >> >> > > me
> > > > > > > > > > > >> >> > > > > >> know
> > > > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > > >> >> > > > > >
> > > > > > > > > > > >> >> > > > >
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > > > --
> > > > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > > > >> >> > > >
> > > > > > > > > > > >> >> > >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >> > --
> > > > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > > > >> >> >
> > > > > > > > > > > >> >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > --
> > > > > > > > > > > >> > -- Guozhang
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> --
> > > > > > > > > > > >> -- Guozhang
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Hey Guozhang,

the CommitFailedException wrapping would cover all the non-fatal exceptions
as we listed out in the KIP, generally speaking any exception that could
recover safely by calling abortTxn should be wrapped.

Best,
Boyang

On Thu, Jan 28, 2021 at 5:22 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hey Boyang,
>
> I think maybe there's something cracking here :) I'm just asking for
> clarifications that as of today, which non-fatal exceptions the newly
> introduced CommitFailedException would cover, and it seems to be only 1)
> unknown pid, 2) invalid pid mapping, and 3) concurrent transactions. Is
> that correct?
>
> On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Hey Guozhang, I think TimeoutException would not be covered here as it
> > potentially has a risk of hitting an illegal state on the broker side
> when
> > the previous commit was actually successful. Users should try to increase
> > their max.block.ms to avoid hitting the timeout as a base suggestion,
> > which
> > is discussed in the KIP.
> >
> > As for the umbrella exception, I agree it has values to some extent, but
> in
> > terms of adoption difficulty, additional exception types usually make the
> > error handling more complex than simplified, and we are doing our best to
> > avoid compatibility issues caused by wrapping previously thrown raw
> > exceptions that invalidates user error handling. And I'm also not in
> favor
> > of our exception hierarchy today where all exceptions are subclasses of
> > KafkaException, and we already mixed its handling with other specific
> > exception types that could be caught as superclass KafkaException. That
> > could be addressed in a separate proposal if we see necessary.
> >
> > Boyang
> >
> > On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Thanks Jason for the suggestion, that looks good to me too.
> > >
> > > Regarding the non-fatal exceptions wrapped as CommitFailed, I would
> like
> > to
> > > clarify if we would cover all the following cases: 1) timeout, 2)
> unknown
> > > pid, 3) invalid pid mapping, 4) concurrent transactions?
> > >
> > > BTW I think it still makes sense to use an umbrella exception in case
> in
> > > the future we add more non-fatal cases even though today we only have a
> > > few.
> > >
> > >
> > > Guozhang
> > >
> > > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Thanks Jason, I agree with the proposed solution here, will update
> the
> > > KIP.
> > > >
> > > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <jason@confluent.io
> >
> > > > wrote:
> > > >
> > > > > Hi Boyang,
> > > > >
> > > > > It seems like a reasonable suggestion. I wonder if a flag is
> > sufficient
> > > > > though. The current `Callback` documentation treats "fatal" errors
> > from
> > > > the
> > > > > perspective of the individual message that was sent.
> > > > >
> > > > > ```
> > > > >     *                  Non-Retriable exceptions (fatal, the message
> > > will
> > > > > never be sent):
> > > > > ```
> > > > >
> > > > > However, we also have fatal errors from the perspective of the
> > > > transaction
> > > > > (e.g. when the producer gets fenced). Perhaps that suggests we need
> > > > > something richer than a boolean flag:
> > > > >
> > > > > At a high level, I think the following cases are possible:
> > > > >
> > > > > - message rejected (e.g. message too large, invalid topic)
> > > > > - delivery failed after retries/delivery timeout (e.g. timeout, crc
> > > > error,
> > > > > not enough replicas)
> > > > > - transaction failed (e.g. producer fenced, invalid transaction
> > state)
> > > > >
> > > > > Perhaps instead we can have a type like the following:
> > > > >
> > > > > class SendFailure {
> > > > >   FailureType failureType;
> > > > >   Exception cause;
> > > > > }
> > > > >
> > > > > enum FailureType {
> > > > >   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> > > > > }
> > > > >
> > > > > (Not married to any of these names, just a starting point.)
> > > > >
> > > > > Then we add a new `onCompletion` as you've suggested:
> > > > >
> > > > > default void onCompletion(RecordMetadata metadata, SendFailure
> > > failure) {
> > > > >   this.onCompletion(metadata, failure.cause());
> > > > > }
> > > > >
> > > > > This would give streams and other applications enough information
> to
> > > know
> > > > > whether the message can be retried and whether the transaction can
> be
> > > > > aborted.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > -Jason
> > > > >
> > > > >
> > > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks Jason for the thoughts.
> > > > > >
> > > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <
> > jason@confluent.io
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Boyang,
> > > > > > >
> > > > > > > Thanks for the iterations here. I think this is something we
> > should
> > > > > have
> > > > > > > done a long time ago. It sounds like there are two API changes
> > > here:
> > > > > > >
> > > > > > > 1. We are introducing the `CommitFailedException` to wrap
> > abortable
> > > > > > > errors that are raised from `commitTransaction`. This sounds
> fine
> > > to
> > > > > me.
> > > > > > As
> > > > > > > far as I know, the only case we might need this is when we add
> > > > support
> > > > > to
> > > > > > > let producers recover from coordinator timeouts. Are there any
> > > > others?
> > > > > > >
> > > > > > > I think the purpose here is to ensure non-fatal exceptions are
> > > > unified
> > > > > > under the same
> > > > > > exception umbrella, to make the proceeding to abort any ongoing
> > > > > transaction
> > > > > > much easier.
> > > > > > I don't think `coordinator timeouts` is the only case to recover
> > > here,
> > > > > > since we have other
> > > > > > non-fatal exceptions such as UnknownPid.
> > > > > >
> > > > > > 2. We are wrapping non-fatal errors raised from `send` as
> > > > > `KafkaException`.
> > > > > > > The motivation for this is less clear to me and it doesn't look
> > > like
> > > > > the
> > > > > > > example from the KIP depends on it. My concern here is
> > > compatibility.
> > > > > > > Currently we have the following documentation for the
> `Callback`
> > > API:
> > > > > > >
> > > > > > > ```
> > > > > > >      *                  Non-Retriable exceptions (fatal, the
> > > message
> > > > > will
> > > > > > > never be sent):
> > > > > > >      *
> > > > > > >      *                  InvalidTopicException
> > > > > > >      *                  OffsetMetadataTooLargeException
> > > > > > >      *                  RecordBatchTooLargeException
> > > > > > >      *                  RecordTooLargeException
> > > > > > >      *                  UnknownServerException
> > > > > > >      *                  UnknownProducerIdException
> > > > > > >      *                  InvalidProducerEpochException
> > > > > > >      *
> > > > > > >      *                  Retriable exceptions (transient, may be
> > > > covered
> > > > > > by
> > > > > > > increasing #.retries):
> > > > > > >      *
> > > > > > >      *                  CorruptRecordException
> > > > > > >      *                  InvalidMetadataException
> > > > > > >      *                  NotEnoughReplicasAfterAppendException
> > > > > > >      *                  NotEnoughReplicasException
> > > > > > >      *                  OffsetOutOfRangeException
> > > > > > >      *                  TimeoutException
> > > > > > >      *                  UnknownTopicOrPartitionException
> > > > > > > ```
> > > > > > >
> > > > > > > If we wrap all the retriable exceptions documented here as
> > > > > > > `KafkaException`, wouldn't that break any error handling that
> > users
> > > > > might
> > > > > > > already have? it's gonna introduce a compatibility issue.
> > > > > > >
> > > > > > > The original intention was to simplify `send` callback error
> > > handling
> > > > > by
> > > > > > doing exception wrapping, as on Streams level
> > > > > > we have to prepare an exhausting list of exceptions to catch as
> > > fatal,
> > > > > and
> > > > > > the same lengthy list to catch as
> > > > > > non-fatal. It would be much easier if we got `hints` from the
> > > callback.
> > > > > > However,
> > > > > > I agree there is a compatibility concern, what about deprecating
> > the
> > > > > > existing:
> > > > > >
> > > > > > void onCompletion(RecordMetadata metadata, Exception exception)
> > > > > >
> > > > > > and replace it with:
> > > > > >
> > > > > > default void onCompletion(RecordMetadata metadata, Exception
> > > exception,
> > > > > > boolean isFatal) {
> > > > > >   this.onCompletion(metadata, exception);
> > > > > > }
> > > > > >
> > > > > > to make sure new users get the benefit of understanding the
> > fatality
> > > > > based
> > > > > > on the info presented by the producer?
> > > > > >
> > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <in...@hiringuru.com>
> > > > wrote:
> > > > > > >
> > > > > > > > Why  we are receiving all emails kindly remove us from
> > > > > > > > dev@kafka.apache.org we don't want to receive emails
> anymore.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks Boyang, yes I think I was confused about the
> different
> > > > > > handling
> > > > > > > of
> > > > > > > > > two abortTxn calls, and now I get it was not intentional. I
> > > > think I
> > > > > > do
> > > > > > > > not
> > > > > > > > > have more concerns.
> > > > > > > > >
> > > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the clarification Guozhang, I got your point
> > that
> > > we
> > > > > > want
> > > > > > > to
> > > > > > > > > > have a consistent handling of fatal exceptions being
> thrown
> > > > from
> > > > > > the
> > > > > > > > > > abortTxn. I modified the current template to move the
> fatal
> > > > > > exception
> > > > > > > > > > try-catch outside of the processing loop to make sure we
> > > could
> > > > > get
> > > > > > a
> > > > > > > > chance
> > > > > > > > > > to close consumer/producer modules. Let me know what you
> > > think.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Boyang
> > > > > > > > > >
> > > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > My understanding is that abortTransaction would only
> > throw
> > > > when
> > > > > > the
> > > > > > > > > > > producer is in fatal state. For CommitFailed, the
> > producer
> > > > > should
> > > > > > > > still
> > > > > > > > > > be
> > > > > > > > > > > in the abortable error state, so that abortTransaction
> > call
> > > > > would
> > > > > > > not
> > > > > > > > > > throw.
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <
> > > > > > wangguoz@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> Or are you going to maintain some internal state such
> > > that,
> > > > > the
> > > > > > > > > > >> `abortTransaction` in the catch block would never
> throw
> > > > again?
> > > > > > > > > > >>
> > > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> > > > > > > wangguoz@gmail.com>
> > > > > > > > > > >> wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > > >> >
> > > > > > > > > > >> > I've also thought about this (i.e. using
> CommitFailed
> > > for
> > > > > all
> > > > > > > > > > >> non-fatal),
> > > > > > > > > > >> > but what I'm pondering is that, in the catch
> > > > (CommitFailed)
> > > > > > > block,
> > > > > > > > > > what
> > > > > > > > > > >> > would happen if the `producer.abortTransaction();`
> > > throws
> > > > > > again?
> > > > > > > > > > should
> > > > > > > > > > >> > that be captured as a fatal and cause the client to
> > > close
> > > > > > again.
> > > > > > > > > > >> >
> > > > > > > > > > >> > If yes, then naively the pattern would be:
> > > > > > > > > > >> >
> > > > > > > > > > >> > ...
> > > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > > >> >         // Transaction commit failed with abortable
> > > error,
> > > > > > user
> > > > > > > > could
> > > > > > > > > > >> reset
> > > > > > > > > > >> >         // the application state and resume with a
> new
> > > > > > > > transaction.
> > > > > > > > > > The
> > > > > > > > > > >> > root
> > > > > > > > > > >> >         // cause was wrapped in the thrown
> exception.
> > > > > > > > > > >> >         resetToLastCommittedPositions(consumer);
> > > > > > > > > > >> >         try {
> > > > > > > > > > >> >             producer.abortTransaction();
> > > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > > >> >             producer.close();
> > > > > > > > > > >> >             consumer.close();
> > > > > > > > > > >> >             throw e;
> > > > > > > > > > >> >         }
> > > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > > >> >         producer.close();
> > > > > > > > > > >> >         consumer.close();
> > > > > > > > > > >> >         throw e;
> > > > > > > > > > >> >     }
> > > > > > > > > > >> > ...
> > > > > > > > > > >> >
> > > > > > > > > > >> > Guozhang
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > > >> > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> Jason and I were discussing the new API offline and
> > > > decided
> > > > > > to
> > > > > > > > take
> > > > > > > > > > >> >> another
> > > > > > > > > > >> >> approach. Firstly, the reason not to invent a new
> API
> > > > with
> > > > > > > > returned
> > > > > > > > > > >> >> boolean
> > > > > > > > > > >> >> flag is for compatibility consideration, since old
> > EOS
> > > > code
> > > > > > > > would not
> > > > > > > > > > >> know
> > > > > > > > > > >> >> that a given transaction commit was failed
> internally
> > > as
> > > > > they
> > > > > > > > don't
> > > > > > > > > > >> listen
> > > > > > > > > > >> >> to the output flag. Our proposed alternative
> solution
> > > is
> > > > to
> > > > > > let
> > > > > > > > > > >> >> *commitTransaction
> > > > > > > > > > >> >> throw CommitFailedException whenever the commit
> > failed
> > > > with
> > > > > > > > non-fatal
> > > > > > > > > > >> >> exception*, so that on the caller side the handling
> > > logic
> > > > > > > > becomes:
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> try {
> > > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > > >> >>             producer.commitTransaction();
> > > > > > > > > > >> >>         } else {
> > > > > > > > > > >> >>
>  resetToLastCommittedPositions(consumer);
> > > > > > > > > > >> >>             producer.abortTransaction();
> > > > > > > > > > >> >>         }
> > > > > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > > > > >> >>         // Transaction commit failed with abortable
> > > > error,
> > > > > > user
> > > > > > > > could
> > > > > > > > > > >> >> reset
> > > > > > > > > > >> >>         // the application state and resume with a
> > new
> > > > > > > > transaction.
> > > > > > > > > > The
> > > > > > > > > > >> >> root
> > > > > > > > > > >> >>         // cause was wrapped in the thrown
> exception.
> > > > > > > > > > >> >>         resetToLastCommittedPositions(consumer);
> > > > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > > >> >>         producer.close();
> > > > > > > > > > >> >>         consumer.close();
> > > > > > > > > > >> >>         throw e;
> > > > > > > > > > >> >>     }
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> This approach looks cleaner as all exception types
> > > other
> > > > > than
> > > > > > > > > > >> CommitFailed
> > > > > > > > > > >> >> will doom to be fatal, which is very easy to adopt
> > for
> > > > > users.
> > > > > > > In
> > > > > > > > the
> > > > > > > > > > >> >> meantime, we still maintain the commitTxn behavior
> to
> > > > throw
> > > > > > > > instead
> > > > > > > > > > of
> > > > > > > > > > >> >> silently failing.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> In addition, we decided to drop the recommendation
> to
> > > > > handle
> > > > > > > > > > >> >> TimeoutException and leave it to the users to make
> > the
> > > > > call.
> > > > > > > The
> > > > > > > > > > >> downside
> > > > > > > > > > >> >> for blindly calling abortTxn upon timeout is that
> we
> > > > could
> > > > > > > > result in
> > > > > > > > > > an
> > > > > > > > > > >> >> illegal state when the commit was already
> successful
> > on
> > > > the
> > > > > > > > broker
> > > > > > > > > > >> >> side. Without a good guarantee on the outcome,
> > > > > > overcomplicating
> > > > > > > > the
> > > > > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> Let me know your thoughts on the new approach here,
> > > thank
> > > > > > you!
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> Best,
> > > > > > > > > > >> >> Boyang
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <
> > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > >> >> wrote:
> > > > > > > > > > >> >>
> > > > > > > > > > >> >> > Thanks for your clarification on 2)/3), that
> makes
> > > > sense.
> > > > > > > > > > >> >> >
> > > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > > >> >> > wrote:
> > > > > > > > > > >> >> >
> > > > > > > > > > >> >> > > Thanks for the input Guozhang, replied inline.
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> > > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > > >> >> > wrote:
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > Thanks for the updated KIP. I read it again
> and
> > > > have
> > > > > > the
> > > > > > > > > > >> following
> > > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > 0. I'm a bit concerned that if commitTxn does
> > not
> > > > > throw
> > > > > > > any
> > > > > > > > > > >> >> non-fatal
> > > > > > > > > > >> >> > > > exception, and instead we rely on the
> > subsequent
> > > > > > beginTxn
> > > > > > > > call
> > > > > > > > > > to
> > > > > > > > > > >> >> > throw,
> > > > > > > > > > >> >> > > it
> > > > > > > > > > >> >> > > > may violate some callers with a pattern that
> > > > relying
> > > > > on
> > > > > > > > > > >> commitTxn to
> > > > > > > > > > >> >> > > > succeed to make some non-rollback operations.
> > For
> > > > > > > example:
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > > >> >> > > > // if commitTxn succeeds, then commit the DB
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > -------------
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > The issue is that, committing DB is a
> > > non-rollback
> > > > > > > > operation,
> > > > > > > > > > and
> > > > > > > > > > >> >> users
> > > > > > > > > > >> >> > > may
> > > > > > > > > > >> >> > > > just rely on commitTxn to return without
> error
> > to
> > > > > make
> > > > > > > this
> > > > > > > > > > >> >> > non-rollback
> > > > > > > > > > >> >> > > > call. Of course we can just claim this
> pattern
> > is
> > > > not
> > > > > > > > > > legitimate
> > > > > > > > > > >> >> and is
> > > > > > > > > > >> >> > > not
> > > > > > > > > > >> >> > > > the right way of doing things, but many users
> > may
> > > > > > > naturally
> > > > > > > > > > adopt
> > > > > > > > > > >> >> this
> > > > > > > > > > >> >> > > > pattern.
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > So maybe we should still let commitTxn also
> > throw
> > > > > > > non-fatal
> > > > > > > > > > >> >> exceptions,
> > > > > > > > > > >> >> > > in
> > > > > > > > > > >> >> > > > which case we would then call abortTxn again.
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > But if we do this, the pattern becomes:
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > try {
> > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > >> >> > > > }
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > > >> >> > > >     try {
> > > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > > >> >> > > >     } catch (...) {        // enumerate all
> > fatal
> > > > > > > > exceptions
> > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > > >> >> > > >     }
> > > > > > > > > > >> >> > > > }
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > > > > >> >> > > > try {
> > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > > >> >> > > >     }
> > > > > > > > > > >> >> > > > }
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > Thank makes me think, the alternative we have
> > > > > discussed
> > > > > > > > offline
> > > > > > > > > > >> may
> > > > > > > > > > >> >> be
> > > > > > > > > > >> >> > > > better: let commitTxn() to return a boolean
> > flag.
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > * If it returns true, it means the commit
> > > > succeeded.
> > > > > > > Users
> > > > > > > > can
> > > > > > > > > > >> >> > > comfortably
> > > > > > > > > > >> >> > > > continue and do any external non-rollback
> > > > operations
> > > > > if
> > > > > > > > they
> > > > > > > > > > >> like.
> > > > > > > > > > >> >> > > > * If it returns false, it means the commit
> > failed
> > > > > with
> > > > > > > > > > non-fatal
> > > > > > > > > > >> >> error
> > > > > > > > > > >> >> > > *AND
> > > > > > > > > > >> >> > > > the txn has been aborted*. Users do not need
> to
> > > > call
> > > > > > > > abortTxn
> > > > > > > > > > >> again.
> > > > > > > > > > >> >> > > > * If it throws, then it means fatal errors.
> > Users
> > > > > > should
> > > > > > > > shut
> > > > > > > > > > >> down
> > > > > > > > > > >> >> the
> > > > > > > > > > >> >> > > > client.
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > try {
> > > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > > >> >> > > >    // do something
> > > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > > >> >> > > > }
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > try {
> > > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > > > > > >> >> > > >     } else {
> > > > > > > > > > >> >> > > >         // reset offsets, rollback
> operations,
> > > etc
> > > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > > >> >> > > >     }
> > > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > > >> >> > > > }
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > > >> >> > > >    // do other non-rollbackable things
> > > > > > > > > > >> >> > > > else
> > > > > > > > > > >> >> > > >    // reset offsets, rollback operations, etc
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > Of course, if we want to have better
> visibility
> > > > into
> > > > > > what
> > > > > > > > > > caused
> > > > > > > > > > >> the
> > > > > > > > > > >> >> > > commit
> > > > > > > > > > >> >> > > > to fail and txn to abort. We can let the
> return
> > > > type
> > > > > be
> > > > > > > an
> > > > > > > > enum
> > > > > > > > > > >> >> instead
> > > > > > > > > > >> >> > > of
> > > > > > > > > > >> >> > > > a flag. But my main idea is to still let the
> > > > > commitTxn
> > > > > > be
> > > > > > > > the
> > > > > > > > > > >> final
> > > > > > > > > > >> >> > point
> > > > > > > > > > >> >> > > > users can tell whether this txn succeeded or
> > not,
> > > > > > instead
> > > > > > > > of
> > > > > > > > > > >> >> relying on
> > > > > > > > > > >> >> > > the
> > > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > I agree that adding a boolean flag is indeed
> > > useful
> > > > > in
> > > > > > > this
> > > > > > > > > > case.
> > > > > > > > > > >> >> Will
> > > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> > > 1. Re: "while maintaining the behavior to throw
> > > fatal
> > > > > > > > exception
> > > > > > > > > > in
> > > > > > > > > > >> >> raw"
> > > > > > > > > > >> >> > not
> > > > > > > > > > >> >> > > > sure what do you mean by "throw" here. Are
> you
> > > > > > proposing
> > > > > > > > the
> > > > > > > > > > >> >> callback
> > > > > > > > > > >> >> > > would
> > > > > > > > > > >> >> > > > *pass* (not throw) in any fatal exceptions
> when
> > > > > > triggered
> > > > > > > > > > without
> > > > > > > > > > >> >> > > wrapping?
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > Yes, I want to say *pass*, the benefit is to
> > make
> > > > the
> > > > > > end
> > > > > > > > > > user's
> > > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> > > > 2. I'm not sure I understand the claim
> > regarding
> > > > the
> > > > > > > > callback
> > > > > > > > > > >> that
> > > > > > > > > > >> >> "In
> > > > > > > > > > >> >> > > EOS
> > > > > > > > > > >> >> > > > setup, it is not required to handle the
> > > exception".
> > > > > Are
> > > > > > > you
> > > > > > > > > > >> >> proposing
> > > > > > > > > > >> >> > > that,
> > > > > > > > > > >> >> > > > e.g. in Streams, we do not try to handle any
> > > > > exceptions
> > > > > > > if
> > > > > > > > EOS
> > > > > > > > > > is
> > > > > > > > > > >> >> > enabled
> > > > > > > > > > >> >> > > > in the callback anymore, but just let
> > commitTxn()
> > > > > > itself
> > > > > > > to
> > > > > > > > > > fail
> > > > > > > > > > >> to
> > > > > > > > > > >> >> be
> > > > > > > > > > >> >> > > > notified about the problem? Personally I
> think
> > in
> > > > > > Streams
> > > > > > > > we
> > > > > > > > > > >> should
> > > > > > > > > > >> >> > just
> > > > > > > > > > >> >> > > > make the handling logic of the callback to be
> > > > > > consistent
> > > > > > > > > > >> regardless
> > > > > > > > > > >> >> of
> > > > > > > > > > >> >> > > the
> > > > > > > > > > >> >> > > > EOS settings (today we have different logic
> > > > depending
> > > > > > on
> > > > > > > > this
> > > > > > > > > > >> logic,
> > > > > > > > > > >> >> > > which
> > > > > > > > > > >> >> > > > I think could be unified as well).
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > My idea originates from the claim on send
> API:
> > > > > > > > > > >> >> > > "When used as part of a transaction, it is not
> > > > > necessary
> > > > > > to
> > > > > > > > > > define
> > > > > > > > > > >> a
> > > > > > > > > > >> >> > > callback or check the result of the future  in
> > > order
> > > > to
> > > > > > > > detect
> > > > > > > > > > >> errors
> > > > > > > > > > >> >> > from
> > > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > > >> >> > > My understanding is that for EOS, the exception
> > > will
> > > > be
> > > > > > > > detected
> > > > > > > > > > by
> > > > > > > > > > >> >> > calling
> > > > > > > > > > >> >> > > transactional APIs either way, so it is a
> > duplicate
> > > > > > > handling
> > > > > > > > to
> > > > > > > > > > >> track
> > > > > > > > > > >> >> > > all the sendExceptions in RecordCollector.
> > > However, I
> > > > > > > looked
> > > > > > > > up
> > > > > > > > > > >> >> > > sendException is being used today as follow:
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for any
> > > > default
> > > > > > or
> > > > > > > > > > >> customized
> > > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > > >> >> > > 2. Stop collecting offset info or new
> exceptions
> > > > > > > > > > >> >> > > 3. Check and rethrow exceptions in close(),
> > flush()
> > > > or
> > > > > > new
> > > > > > > > send()
> > > > > > > > > > >> >> calls
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> > > To my understanding, we should still honor the
> > > > > commitment
> > > > > > > to
> > > > > > > > #1
> > > > > > > > > > for
> > > > > > > > > > >> >> any
> > > > > > > > > > >> >> > > user customized implementation. The #2 does not
> > > > really
> > > > > > > > affect our
> > > > > > > > > > >> >> > decision
> > > > > > > > > > >> >> > > upon EOS. The #3 here is still valuable as it
> > could
> > > > > help
> > > > > > us
> > > > > > > > fail
> > > > > > > > > > >> fast
> > > > > > > > > > >> >> in
> > > > > > > > > > >> >> > > new send() instead of waiting to later stage of
> > > > > > processing.
> > > > > > > > In
> > > > > > > > > > that
> > > > > > > > > > >> >> > sense,
> > > > > > > > > > >> >> > > I agree we should continue to record send
> > > exceptions
> > > > > even
> > > > > > > > under
> > > > > > > > > > EOS
> > > > > > > > > > >> >> case
> > > > > > > > > > >> >> > to
> > > > > > > > > > >> >> > > ensure the strength of stream side Producer
> > logic.
> > > On
> > > > > the
> > > > > > > > safer
> > > > > > > > > > >> side,
> > > > > > > > > > >> >> we
> > > > > > > > > > >> >> > no
> > > > > > > > > > >> >> > > longer need to wrap certain fatal exceptions
> like
> > > > > > > > ProducerFenced
> > > > > > > > > > as
> > > > > > > > > > >> >> > > TaskMigrated, since they should not crash the
> > > stream
> > > > > > thread
> > > > > > > > if
> > > > > > > > > > >> thrown
> > > > > > > > > > >> >> in
> > > > > > > > > > >> >> > > raw format, once we adopt the new processing
> > model
> > > in
> > > > > the
> > > > > > > > send
> > > > > > > > > > >> phase.
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > > Thanks for everyone's feedback so far. I
> have
> > > > > > polished
> > > > > > > > the
> > > > > > > > > > KIP
> > > > > > > > > > >> >> after
> > > > > > > > > > >> >> > > > > offline discussion with some folks working
> on
> > > EOS
> > > > > to
> > > > > > > > make the
> > > > > > > > > > >> >> > exception
> > > > > > > > > > >> >> > > > > handling more lightweight. The essential
> > change
> > > > is
> > > > > > that
> > > > > > > > we
> > > > > > > > > > are
> > > > > > > > > > >> not
> > > > > > > > > > >> >> > > > > inventing a new intermediate exception
> type,
> > > but
> > > > > > > instead
> > > > > > > > > > >> >> separating a
> > > > > > > > > > >> >> > > > full
> > > > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > > > >> >> > > > >
> > > > > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > > >> >> > > > >
> > > > > > > > > > >> >> > > > > This way for any exception thrown from
> phase
> > 1,
> > > > > will
> > > > > > be
> > > > > > > > an
> > > > > > > > > > >> >> indicator
> > > > > > > > > > >> >> > in
> > > > > > > > > > >> >> > > > > phase 2 whether we should do commit or
> abort,
> > > and
> > > > > > from
> > > > > > > > now on
> > > > > > > > > > >> >> > > > > `commitTransaction` should only throw fatal
> > > > > > exceptions,
> > > > > > > > > > >> similar to
> > > > > > > > > > >> >> > > > > `abortTransaction`, so that any
> > KafkaException
> > > > > caught
> > > > > > > in
> > > > > > > > the
> > > > > > > > > > >> >> commit
> > > > > > > > > > >> >> > > phase
> > > > > > > > > > >> >> > > > > will be definitely fatal to crash the app.
> > For
> > > > more
> > > > > > > > advanced
> > > > > > > > > > >> users
> > > > > > > > > > >> >> > such
> > > > > > > > > > >> >> > > > as
> > > > > > > > > > >> >> > > > > Streams, we have the ability to further
> wrap
> > > > > selected
> > > > > > > > types
> > > > > > > > > > of
> > > > > > > > > > >> >> fatal
> > > > > > > > > > >> >> > > > > exceptions to trigger task migration if
> > > > necessary.
> > > > > > > > > > >> >> > > > >
> > > > > > > > > > >> >> > > > > More details in the KIP, feel free to take
> > > > another
> > > > > > > look,
> > > > > > > > > > >> thanks!
> > > > > > > > > > >> >> > > > >
> > > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang
> Chen <
> > > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > > >> >> > > > >
> > > > > > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > > > > > >> >> > > > > >
> > > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno
> > Cadonna
> > > <
> > > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > > >> >> > > > wrote:
> > > > > > > > > > >> >> > > > > >
> > > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >> Like Matthias, I do also not know the
> > > producer
> > > > > > > > internal
> > > > > > > > > > well
> > > > > > > > > > >> >> > enough
> > > > > > > > > > >> >> > > to
> > > > > > > > > > >> >> > > > > >> comment on the categorization. However,
> I
> > > > think
> > > > > > > > having a
> > > > > > > > > > >> super
> > > > > > > > > > >> >> > > > exception
> > > > > > > > > > >> >> > > > > >> (e.g. RetriableException) that  encodes
> if
> > > an
> > > > > > > > exception is
> > > > > > > > > > >> >> fatal
> > > > > > > > > > >> >> > or
> > > > > > > > > > >> >> > > > not
> > > > > > > > > > >> >> > > > > >> is cleaner, better understandable and
> less
> > > > > > > > error-prone,
> > > > > > > > > > >> because
> > > > > > > > > > >> >> > > > ideally
> > > > > > > > > > >> >> > > > > >> when you add a new non-fatal exception
> in
> > > > future
> > > > > > you
> > > > > > > > just
> > > > > > > > > > >> need
> > > > > > > > > > >> >> to
> > > > > > > > > > >> >> > > > think
> > > > > > > > > > >> >> > > > > >> about letting it inherit from the super
> > > > > exception
> > > > > > > and
> > > > > > > > all
> > > > > > > > > > >> the
> > > > > > > > > > >> >> rest
> > > > > > > > > > >> >> > > of
> > > > > > > > > > >> >> > > > > >> the code will just behave correctly
> > without
> > > > the
> > > > > > need
> > > > > > > > to
> > > > > > > > > > wrap
> > > > > > > > > > >> >> the
> > > > > > > > > > >> >> > new
> > > > > > > > > > >> >> > > > > >> exception into another exception each
> time
> > > it
> > > > is
> > > > > > > > thrown
> > > > > > > > > > >> (maybe
> > > > > > > > > > >> >> it
> > > > > > > > > > >> >> > is
> > > > > > > > > > >> >> > > > > >> thrown at different location in the
> code).
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >> As far as I understand the following
> > > statement
> > > > > > from
> > > > > > > > your
> > > > > > > > > > >> >> previous
> > > > > > > > > > >> >> > > > e-mail
> > > > > > > > > > >> >> > > > > >> is the reason that currently such a
> super
> > > > > > exception
> > > > > > > > is not
> > > > > > > > > > >> >> > possible:
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >> "Right now we have RetriableException
> > type,
> > > if
> > > > > we
> > > > > > > are
> > > > > > > > > > going
> > > > > > > > > > >> to
> > > > > > > > > > >> >> > add a
> > > > > > > > > > >> >> > > > > >> `ProducerRetriableException` type, we
> have
> > > to
> > > > > put
> > > > > > > > this new
> > > > > > > > > > >> >> > interface
> > > > > > > > > > >> >> > > > as
> > > > > > > > > > >> >> > > > > >> the parent of the RetriableException,
> > > because
> > > > > not
> > > > > > > all
> > > > > > > > > > thrown
> > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > >> >> > > > > >> exceptions are `retriable` in general
> for
> > > > > > producer"
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >> In the list of exceptions in your KIP, I
> > > found
> > > > > > > > non-fatal
> > > > > > > > > > >> >> > exceptions
> > > > > > > > > > >> >> > > > that
> > > > > > > > > > >> >> > > > > >> do not inherit from RetriableException.
> I
> > > > guess
> > > > > > > those
> > > > > > > > are
> > > > > > > > > > >> the
> > > > > > > > > > >> >> ones
> > > > > > > > > > >> >> > > you
> > > > > > > > > > >> >> > > > > >> are referring to in your statement:
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >> All of those exceptions are non-fatal
> and
> > do
> > > > not
> > > > > > > > inherit
> > > > > > > > > > >> from
> > > > > > > > > > >> >> > > > > >> RetriableException. Is there a reason
> for
> > > > that?
> > > > > If
> > > > > > > > they
> > > > > > > > > > >> >> depended
> > > > > > > > > > >> >> > > from
> > > > > > > > > > >> >> > > > > >> RetriableException we would be a bit
> > closer
> > > > to a
> > > > > > > super
> > > > > > > > > > >> >> exception I
> > > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >> The reason is that sender may catch
> those
> > > > > > exceptions
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > >> >> > > > > > ProduceResponse, and it currently does
> > > infinite
> > > > > > > > > > >> >> > > > > > retries on RetriableException. To make
> sure
> > > we
> > > > > > could
> > > > > > > > > > trigger
> > > > > > > > > > >> the
> > > > > > > > > > >> >> > > > > > abortTransaction() by catching non-fatal
> > > thrown
> > > > > > > > > > >> >> > > > > > exceptions and reinitialize the txn
> state,
> > we
> > > > > chose
> > > > > > > > not to
> > > > > > > > > > >> let
> > > > > > > > > > >> >> > those
> > > > > > > > > > >> >> > > > > > exceptions inherit RetriableException so
> > that
> > > > > > > > > > >> >> > > > > > they won't cause infinite retry on
> sender.
> > > > > > > > > > >> >> > > > > >
> > > > > > > > > > >> >> > > > > >
> > > > > > > > > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > > >> >> > > > think
> > > > > > > > > > >> >> > > > > >> to understand that their fatality
> depends
> > on
> > > > the
> > > > > > > type
> > > > > > > > > > (i.e.
> > > > > > > > > > >> >> > > > > >> configuration) of the producer. That
> makes
> > > it
> > > > > > > > difficult to
> > > > > > > > > > >> >> have a
> > > > > > > > > > >> >> > > > super
> > > > > > > > > > >> >> > > > > >> exception that encodes the retriability
> as
> > > > > > mentioned
> > > > > > > > > > above.
> > > > > > > > > > >> >> Would
> > > > > > > > > > >> >> > it
> > > > > > > > > > >> >> > > > be
> > > > > > > > > > >> >> > > > > >> possible to introduce exceptions that
> > > inherit
> > > > > from
> > > > > > > > > > >> >> > > RetriableException
> > > > > > > > > > >> >> > > > > >> and that are thrown when those
> exceptions
> > > are
> > > > > > caught
> > > > > > > > from
> > > > > > > > > > >> the
> > > > > > > > > > >> >> > > brokers
> > > > > > > > > > >> >> > > > > >> and the type of the producer is such
> that
> > > the
> > > > > > > > exceptions
> > > > > > > > > > are
> > > > > > > > > > >> >> > > > retriable?
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >> Yea, I think in general the exception
> type
> > > > > mixing
> > > > > > is
> > > > > > > > > > >> difficult
> > > > > > > > > > >> >> to
> > > > > > > > > > >> >> > > get
> > > > > > > > > > >> >> > > > > > them all right. I have already proposed
> > > another
> > > > > > > > solution
> > > > > > > > > > >> based
> > > > > > > > > > >> >> on
> > > > > > > > > > >> >> > my
> > > > > > > > > > >> >> > > > > > offline discussion with some folks
> working
> > on
> > > > EOS
> > > > > > > > > > >> >> > > > > > to make the handling more straightforward
> > for
> > > > end
> > > > > > > users
> > > > > > > > > > >> without
> > > > > > > > > > >> >> the
> > > > > > > > > > >> >> > > > need
> > > > > > > > > > >> >> > > > > > to distinguish exception fatality.
> > > > > > > > > > >> >> > > > > >
> > > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > > > > > > > >> >> > > > > >> > Thanks Boyang for the proposal! I
> made a
> > > > pass
> > > > > > over
> > > > > > > > the
> > > > > > > > > > >> list
> > > > > > > > > > >> >> and
> > > > > > > > > > >> >> > > here
> > > > > > > > > > >> >> > > > > are
> > > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> > 0) Although this is not part of the
> > public
> > > > > API,
> > > > > > I
> > > > > > > > think
> > > > > > > > > > we
> > > > > > > > > > >> >> > should
> > > > > > > > > > >> >> > > > make
> > > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > > >> >> > > > > >> > that the suggested pattern (i.e. user
> > can
> > > > > always
> > > > > > > > call
> > > > > > > > > > >> >> abortTxn()
> > > > > > > > > > >> >> > > > when
> > > > > > > > > > >> >> > > > > >> > handling non-fatal errors) are indeed
> > > > > supported.
> > > > > > > > E.g. if
> > > > > > > > > > >> the
> > > > > > > > > > >> >> txn
> > > > > > > > > > >> >> > > is
> > > > > > > > > > >> >> > > > > >> already
> > > > > > > > > > >> >> > > > > >> > aborted by the broker side, then users
> > can
> > > > > still
> > > > > > > > call
> > > > > > > > > > >> >> abortTxn
> > > > > > > > > > >> >> > > which
> > > > > > > > > > >> >> > > > > >> would
> > > > > > > > > > >> >> > > > > >> > not throw another exception but just
> be
> > > > > treated
> > > > > > > as a
> > > > > > > > > > >> no-op.
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> > 1) *ConcurrentTransactionsException*:
> I
> > > > think
> > > > > > this
> > > > > > > > error
> > > > > > > > > > >> can
> > > > > > > > > > >> >> > also
> > > > > > > > > > >> >> > > be
> > > > > > > > > > >> >> > > > > >> > returned but not documented yet. This
> > > should
> > > > > be
> > > > > > a
> > > > > > > > > > >> non-fatal
> > > > > > > > > > >> >> > error.
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this
> > error
> > > is
> > > > > > > > returned
> > > > > > > > > > from
> > > > > > > > > > >> >> > broker
> > > > > > > > > > >> >> > > > when
> > > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > > >> >> > > > > >> > state transition failed (e.g. it is
> > trying
> > > > to
> > > > > > > > transit to
> > > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > > >> >> > > > > >> > while the current state is not
> > > > > prepare-commit).
> > > > > > > This
> > > > > > > > > > error
> > > > > > > > > > >> >> could
> > > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > > >> >> > > > > >> > a bug on the client internal code or
> the
> > > > > broker
> > > > > > > > code,
> > > > > > > > > > OR a
> > > > > > > > > > >> >> user
> > > > > > > > > > >> >> > > > error
> > > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > > >> >> > > > > >> > similar error is
> > > > > > ConcurrentTransactionsException,
> > > > > > > > i.e.
> > > > > > > > > > if
> > > > > > > > > > >> >> Kafka
> > > > > > > > > > >> >> > is
> > > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > > >> >> > > > > >> > these exceptions would only be
> returned
> > if
> > > > > users
> > > > > > > > try to
> > > > > > > > > > do
> > > > > > > > > > >> >> > > something
> > > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn right after a
> > > > commitTxn,
> > > > > > > etc.
> > > > > > > > So
> > > > > > > > > > I'm
> > > > > > > > > > >> >> > > thinking
> > > > > > > > > > >> >> > > > it
> > > > > > > > > > >> >> > > > > >> > should be a non-fatal error instead
> of a
> > > > fatal
> > > > > > > > error,
> > > > > > > > > > >> wdyt?
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i "indicates
> > > fatal
> > > > > > > > > > transactional
> > > > > > > > > > >> >> > > sequence
> > > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit conflicting
> with
> > > the
> > > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > > >> >> > > > > >> that
> > > > > > > > > > >> >> > > > > >> > is treated as non-fatal. I guess your
> > > > proposal
> > > > > > is
> > > > > > > > that
> > > > > > > > > > >> >> > > > > >> > OutOfOrderSequenceException would be
> > > treated
> > > > > > > either
> > > > > > > > as
> > > > > > > > > > >> fatal
> > > > > > > > > > >> >> > with
> > > > > > > > > > >> >> > > > > >> > transactional producer, or non-fatal
> > with
> > > > > > > idempotent
> > > > > > > > > > >> >> producer,
> > > > > > > > > > >> >> > is
> > > > > > > > > > >> >> > > > that
> > > > > > > > > > >> >> > > > > >> > right? If the producer is only
> > configured
> > > > with
> > > > > > > > > > idempotency
> > > > > > > > > > >> >> but
> > > > > > > > > > >> >> > not
> > > > > > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > > >> >> > > for
> > > > > > > > > > >> >> > > > > >> > non-fatal errors would be confusing
> > since
> > > > > users
> > > > > > > are
> > > > > > > > not
> > > > > > > > > > >> using
> > > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > > >> >> > > > > >> > at all.. So I suggest we always throw
> > > > > > > > > > >> OutOfSequenceException
> > > > > > > > > > >> >> > as-is
> > > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > > >> >> > > > > >> > not wrapped) no matter how the
> producer
> > is
> > > > > > > > configured,
> > > > > > > > > > and
> > > > > > > > > > >> >> let
> > > > > > > > > > >> >> > the
> > > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > > >> >> > > > > >> > decide how to handle it based on
> whether
> > > it
> > > > is
> > > > > > > only
> > > > > > > > > > >> >> idempotent
> > > > > > > > > > >> >> > or
> > > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the
> > `send()`
> > > > > > > callback /
> > > > > > > > > > >> future
> > > > > > > > > > >> >> can
> > > > > > > > > > >> >> > > also
> > > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > > >> >> > > > > >> > txn-related exceptions, I think this
> KIP
> > > > > should
> > > > > > > also
> > > > > > > > > > cover
> > > > > > > > > > >> >> this
> > > > > > > > > > >> >> > > API
> > > > > > > > > > >> >> > > > as
> > > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> > 5) This is related to 1/2) above:
> > > sometimes
> > > > > > those
> > > > > > > > > > >> non-fatal
> > > > > > > > > > >> >> > errors
> > > > > > > > > > >> >> > > > > like
> > > > > > > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are
> not
> > > due
> > > > > to
> > > > > > > the
> > > > > > > > > > state
> > > > > > > > > > >> >> being
> > > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > > >> >> > > > > >> > at the broker side, but maybe users
> are
> > > > doing
> > > > > > > > something
> > > > > > > > > > >> >> wrong.
> > > > > > > > > > >> >> > So
> > > > > > > > > > >> >> > > > I'm
> > > > > > > > > > >> >> > > > > >> > wondering if we should further
> > distinguish
> > > > > those
> > > > > > > > > > non-fatal
> > > > > > > > > > >> >> > errors
> > > > > > > > > > >> >> > > > > >> between
> > > > > > > > > > >> >> > > > > >> > a) those that are caused by Kafka
> > itself,
> > > > > e.g. a
> > > > > > > > broker
> > > > > > > > > > >> timed
> > > > > > > > > > >> >> > out
> > > > > > > > > > >> >> > > > and
> > > > > > > > > > >> >> > > > > >> > aborted a txn and later an endTxn
> > request
> > > is
> > > > > > > > received,
> > > > > > > > > > >> and b)
> > > > > > > > > > >> >> > the
> > > > > > > > > > >> >> > > > > user's
> > > > > > > > > > >> >> > > > > >> > API call pattern is incorrect, causing
> > the
> > > > > > request
> > > > > > > > to be
> > > > > > > > > > >> >> > rejected
> > > > > > > > > > >> >> > > > with
> > > > > > > > > > >> >> > > > > >> an
> > > > > > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > > > >> >> > > > feels
> > > > > > > > > > >> >> > > > > >> to
> > > > > > > > > > >> >> > > > > >> > me more like for case a), but not case
> > b).
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang
> > > Chen <
> > > > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your
> proposal
> > > > makes
> > > > > > > sense
> > > > > > > > as
> > > > > > > > > > >> well,
> > > > > > > > > > >> >> on
> > > > > > > > > > >> >> > > the
> > > > > > > > > > >> >> > > > > pro
> > > > > > > > > > >> >> > > > > >> side
> > > > > > > > > > >> >> > > > > >> >> we could have a universally agreed
> > > > exception
> > > > > > type
> > > > > > > > to be
> > > > > > > > > > >> >> caught
> > > > > > > > > > >> >> > by
> > > > > > > > > > >> >> > > > the
> > > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > > >> >> > > > > >> >> without having an extra layer on top
> of
> > > the
> > > > > > > actual
> > > > > > > > > > >> >> exceptions.
> > > > > > > > > > >> >> > I
> > > > > > > > > > >> >> > > > > could
> > > > > > > > > > >> >> > > > > >> see
> > > > > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > >> >> > > > > >> >> 1. The exception hierarchy will be
> more
> > > > > > complex.
> > > > > > > > Right
> > > > > > > > > > >> now
> > > > > > > > > > >> >> we
> > > > > > > > > > >> >> > > have
> > > > > > > > > > >> >> > > > > >> >> RetriableException type, if we are
> > going
> > > to
> > > > > > add a
> > > > > > > > > > >> >> > > > > >> >> `ProducerRetriableException` type, we
> > > have
> > > > to
> > > > > > put
> > > > > > > > this
> > > > > > > > > > >> new
> > > > > > > > > > >> >> > > > interface
> > > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > > >> >> > > > > >> >> parent of the RetriableException,
> > because
> > > > not
> > > > > > all
> > > > > > > > > > thrown
> > > > > > > > > > >> >> > > non-fatal
> > > > > > > > > > >> >> > > > > >> >> exceptions are `retriable` in general
> > for
> > > > > > > > producer, for
> > > > > > > > > > >> >> example
> > > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > >
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> >
> > > > > > > > > > >> >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > > >> >> > > > > >> >> We could have an empty interface
> > > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > > >> >> > to
> > > > > > > > > > >> >> > > > let
> > > > > > > > > > >> >> > > > > >> all
> > > > > > > > > > >> >> > > > > >> >> the thrown exceptions implement for
> > sure,
> > > > but
> > > > > > > it's
> > > > > > > > a
> > > > > > > > > > bit
> > > > > > > > > > >> >> > > unorthodox
> > > > > > > > > > >> >> > > > > in
> > > > > > > > > > >> >> > > > > >> the
> > > > > > > > > > >> >> > > > > >> >> way we deal with exceptions in
> general.
> > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > >> >> > > > > >> >> 2. There are cases where we throw a
> > > > > > > KafkaException
> > > > > > > > > > >> wrapping
> > > > > > > > > > >> >> > > another
> > > > > > > > > > >> >> > > > > >> >> KafkaException as either fatal or
> > > > non-fatal.
> > > > > If
> > > > > > > we
> > > > > > > > use
> > > > > > > > > > an
> > > > > > > > > > >> >> > > interface
> > > > > > > > > > >> >> > > > > to
> > > > > > > > > > >> >> > > > > >> >> solve #1, it is also required to
> > > implement
> > > > > > > another
> > > > > > > > > > >> bloated
> > > > > > > > > > >> >> > > > exception
> > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > >> >> > > > > >> >> which could replace KafkaException
> > type,
> > > as
> > > > > we
> > > > > > > > couldn't
> > > > > > > > > > >> mark
> > > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > >> >> > > > > >> >> 3. In terms of the encapsulation,
> > > wrapping
> > > > > > means
> > > > > > > we
> > > > > > > > > > could
> > > > > > > > > > >> >> limit
> > > > > > > > > > >> >> > > the
> > > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > > >> >> > > > > >> >> of affection to the producer only,
> > which
> > > is
> > > > > > > > important
> > > > > > > > > > >> since
> > > > > > > > > > >> >> we
> > > > > > > > > > >> >> > > > don't
> > > > > > > > > > >> >> > > > > >> want
> > > > > > > > > > >> >> > > > > >> >> shared exception types to implement a
> > > > > > > > producer-related
> > > > > > > > > > >> >> > interface,
> > > > > > > > > > >> >> > > > > such
> > > > > > > > > > >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM
> Matthias
> > > J.
> > > > > Sax
> > > > > > <
> > > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > >> >> > > > > >> >>> Overall, categorizing exceptions
> > makes a
> > > > lot
> > > > > > of
> > > > > > > > sense.
> > > > > > > > > > >> As I
> > > > > > > > > > >> >> > > don't
> > > > > > > > > > >> >> > > > > know
> > > > > > > > > > >> >> > > > > >> >>> the producer internals well enough,
> I
> > > > cannot
> > > > > > > > comment
> > > > > > > > > > on
> > > > > > > > > > >> the
> > > > > > > > > > >> >> > > > > >> >>> categorization in detail though.
> > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > >> >> > > > > >> >>> What I am wondering is, if we should
> > > > > introduce
> > > > > > > an
> > > > > > > > > > >> exception
> > > > > > > > > > >> >> > > > > interface
> > > > > > > > > > >> >> > > > > >> >>> that non-fatal exception would
> > implement
> > > > > > instead
> > > > > > > > of
> > > > > > > > > > >> >> creating a
> > > > > > > > > > >> >> > > new
> > > > > > > > > > >> >> > > > > >> class
> > > > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal exceptions?
> > > What
> > > > > > would
> > > > > > > > be the
> > > > > > > > > > >> >> > pros/cons
> > > > > > > > > > >> >> > > > for
> > > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen
> > wrote:
> > > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > >> >> > > > > >> >>>> I would like to start a discussion
> > > thread
> > > > > for
> > > > > > > > > > KIP-691:
> > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > >
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> >
> > > > > > > > > > >> >>
> > > > > > > > > > >>
> > > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify the
> > > > exception
> > > > > > > > handling
> > > > > > > > > > >> logic
> > > > > > > > > > >> >> > for
> > > > > > > > > > >> >> > > > > >> >>>> transactional Producer users by
> > > > classifying
> > > > > > > > fatal and
> > > > > > > > > > >> >> > non-fatal
> > > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > > >> >> > > > > >> >>>> and throw them correspondingly for
> > > easier
> > > > > > catch
> > > > > > > > and
> > > > > > > > > > >> retry.
> > > > > > > > > > >> >> > Let
> > > > > > > > > > >> >> > > me
> > > > > > > > > > >> >> > > > > >> know
> > > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >> >
> > > > > > > > > > >> >> > > > > >>
> > > > > > > > > > >> >> > > > > >
> > > > > > > > > > >> >> > > > >
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > > > --
> > > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > > >> >> > > >
> > > > > > > > > > >> >> > >
> > > > > > > > > > >> >> >
> > > > > > > > > > >> >> >
> > > > > > > > > > >> >> > --
> > > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > > >> >> >
> > > > > > > > > > >> >>
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > --
> > > > > > > > > > >> > -- Guozhang
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> --
> > > > > > > > > > >> -- Guozhang
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -- Guozhang
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Guozhang Wang <wa...@gmail.com>.
Hey Boyang,

I think maybe there's something cracking here :) I'm just asking for
clarifications that as of today, which non-fatal exceptions the newly
introduced CommitFailedException would cover, and it seems to be only 1)
unknown pid, 2) invalid pid mapping, and 3) concurrent transactions. Is
that correct?

On Thu, Jan 28, 2021 at 5:06 PM Boyang Chen <re...@gmail.com>
wrote:

> Hey Guozhang, I think TimeoutException would not be covered here as it
> potentially has a risk of hitting an illegal state on the broker side when
> the previous commit was actually successful. Users should try to increase
> their max.block.ms to avoid hitting the timeout as a base suggestion,
> which
> is discussed in the KIP.
>
> As for the umbrella exception, I agree it has values to some extent, but in
> terms of adoption difficulty, additional exception types usually make the
> error handling more complex than simplified, and we are doing our best to
> avoid compatibility issues caused by wrapping previously thrown raw
> exceptions that invalidates user error handling. And I'm also not in favor
> of our exception hierarchy today where all exceptions are subclasses of
> KafkaException, and we already mixed its handling with other specific
> exception types that could be caught as superclass KafkaException. That
> could be addressed in a separate proposal if we see necessary.
>
> Boyang
>
> On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Thanks Jason for the suggestion, that looks good to me too.
> >
> > Regarding the non-fatal exceptions wrapped as CommitFailed, I would like
> to
> > clarify if we would cover all the following cases: 1) timeout, 2) unknown
> > pid, 3) invalid pid mapping, 4) concurrent transactions?
> >
> > BTW I think it still makes sense to use an umbrella exception in case in
> > the future we add more non-fatal cases even though today we only have a
> > few.
> >
> >
> > Guozhang
> >
> > On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Thanks Jason, I agree with the proposed solution here, will update the
> > KIP.
> > >
> > > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Boyang,
> > > >
> > > > It seems like a reasonable suggestion. I wonder if a flag is
> sufficient
> > > > though. The current `Callback` documentation treats "fatal" errors
> from
> > > the
> > > > perspective of the individual message that was sent.
> > > >
> > > > ```
> > > >     *                  Non-Retriable exceptions (fatal, the message
> > will
> > > > never be sent):
> > > > ```
> > > >
> > > > However, we also have fatal errors from the perspective of the
> > > transaction
> > > > (e.g. when the producer gets fenced). Perhaps that suggests we need
> > > > something richer than a boolean flag:
> > > >
> > > > At a high level, I think the following cases are possible:
> > > >
> > > > - message rejected (e.g. message too large, invalid topic)
> > > > - delivery failed after retries/delivery timeout (e.g. timeout, crc
> > > error,
> > > > not enough replicas)
> > > > - transaction failed (e.g. producer fenced, invalid transaction
> state)
> > > >
> > > > Perhaps instead we can have a type like the following:
> > > >
> > > > class SendFailure {
> > > >   FailureType failureType;
> > > >   Exception cause;
> > > > }
> > > >
> > > > enum FailureType {
> > > >   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> > > > }
> > > >
> > > > (Not married to any of these names, just a starting point.)
> > > >
> > > > Then we add a new `onCompletion` as you've suggested:
> > > >
> > > > default void onCompletion(RecordMetadata metadata, SendFailure
> > failure) {
> > > >   this.onCompletion(metadata, failure.cause());
> > > > }
> > > >
> > > > This would give streams and other applications enough information to
> > know
> > > > whether the message can be retried and whether the transaction can be
> > > > aborted.
> > > >
> > > > What do you think?
> > > >
> > > > -Jason
> > > >
> > > >
> > > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks Jason for the thoughts.
> > > > >
> > > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <
> jason@confluent.io
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi Boyang,
> > > > > >
> > > > > > Thanks for the iterations here. I think this is something we
> should
> > > > have
> > > > > > done a long time ago. It sounds like there are two API changes
> > here:
> > > > > >
> > > > > > 1. We are introducing the `CommitFailedException` to wrap
> abortable
> > > > > > errors that are raised from `commitTransaction`. This sounds fine
> > to
> > > > me.
> > > > > As
> > > > > > far as I know, the only case we might need this is when we add
> > > support
> > > > to
> > > > > > let producers recover from coordinator timeouts. Are there any
> > > others?
> > > > > >
> > > > > > I think the purpose here is to ensure non-fatal exceptions are
> > > unified
> > > > > under the same
> > > > > exception umbrella, to make the proceeding to abort any ongoing
> > > > transaction
> > > > > much easier.
> > > > > I don't think `coordinator timeouts` is the only case to recover
> > here,
> > > > > since we have other
> > > > > non-fatal exceptions such as UnknownPid.
> > > > >
> > > > > 2. We are wrapping non-fatal errors raised from `send` as
> > > > `KafkaException`.
> > > > > > The motivation for this is less clear to me and it doesn't look
> > like
> > > > the
> > > > > > example from the KIP depends on it. My concern here is
> > compatibility.
> > > > > > Currently we have the following documentation for the `Callback`
> > API:
> > > > > >
> > > > > > ```
> > > > > >      *                  Non-Retriable exceptions (fatal, the
> > message
> > > > will
> > > > > > never be sent):
> > > > > >      *
> > > > > >      *                  InvalidTopicException
> > > > > >      *                  OffsetMetadataTooLargeException
> > > > > >      *                  RecordBatchTooLargeException
> > > > > >      *                  RecordTooLargeException
> > > > > >      *                  UnknownServerException
> > > > > >      *                  UnknownProducerIdException
> > > > > >      *                  InvalidProducerEpochException
> > > > > >      *
> > > > > >      *                  Retriable exceptions (transient, may be
> > > covered
> > > > > by
> > > > > > increasing #.retries):
> > > > > >      *
> > > > > >      *                  CorruptRecordException
> > > > > >      *                  InvalidMetadataException
> > > > > >      *                  NotEnoughReplicasAfterAppendException
> > > > > >      *                  NotEnoughReplicasException
> > > > > >      *                  OffsetOutOfRangeException
> > > > > >      *                  TimeoutException
> > > > > >      *                  UnknownTopicOrPartitionException
> > > > > > ```
> > > > > >
> > > > > > If we wrap all the retriable exceptions documented here as
> > > > > > `KafkaException`, wouldn't that break any error handling that
> users
> > > > might
> > > > > > already have? it's gonna introduce a compatibility issue.
> > > > > >
> > > > > > The original intention was to simplify `send` callback error
> > handling
> > > > by
> > > > > doing exception wrapping, as on Streams level
> > > > > we have to prepare an exhausting list of exceptions to catch as
> > fatal,
> > > > and
> > > > > the same lengthy list to catch as
> > > > > non-fatal. It would be much easier if we got `hints` from the
> > callback.
> > > > > However,
> > > > > I agree there is a compatibility concern, what about deprecating
> the
> > > > > existing:
> > > > >
> > > > > void onCompletion(RecordMetadata metadata, Exception exception)
> > > > >
> > > > > and replace it with:
> > > > >
> > > > > default void onCompletion(RecordMetadata metadata, Exception
> > exception,
> > > > > boolean isFatal) {
> > > > >   this.onCompletion(metadata, exception);
> > > > > }
> > > > >
> > > > > to make sure new users get the benefit of understanding the
> fatality
> > > > based
> > > > > on the info presented by the producer?
> > > > >
> > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > >
> > > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <in...@hiringuru.com>
> > > wrote:
> > > > > >
> > > > > > > Why  we are receiving all emails kindly remove us from
> > > > > > > dev@kafka.apache.org we don't want to receive emails anymore.
> > > > > > >
> > > > > > > Thanks
> > > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks Boyang, yes I think I was confused about the different
> > > > > handling
> > > > > > of
> > > > > > > > two abortTxn calls, and now I get it was not intentional. I
> > > think I
> > > > > do
> > > > > > > not
> > > > > > > > have more concerns.
> > > > > > > >
> > > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > > reluctanthero104@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the clarification Guozhang, I got your point
> that
> > we
> > > > > want
> > > > > > to
> > > > > > > > > have a consistent handling of fatal exceptions being thrown
> > > from
> > > > > the
> > > > > > > > > abortTxn. I modified the current template to move the fatal
> > > > > exception
> > > > > > > > > try-catch outside of the processing loop to make sure we
> > could
> > > > get
> > > > > a
> > > > > > > chance
> > > > > > > > > to close consumer/producer modules. Let me know what you
> > think.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Boyang
> > > > > > > > >
> > > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > > > > reluctanthero104@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > My understanding is that abortTransaction would only
> throw
> > > when
> > > > > the
> > > > > > > > > > producer is in fatal state. For CommitFailed, the
> producer
> > > > should
> > > > > > > still
> > > > > > > > > be
> > > > > > > > > > in the abortable error state, so that abortTransaction
> call
> > > > would
> > > > > > not
> > > > > > > > > throw.
> > > > > > > > > >
> > > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <
> > > > > wangguoz@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Or are you going to maintain some internal state such
> > that,
> > > > the
> > > > > > > > > >> `abortTransaction` in the catch block would never throw
> > > again?
> > > > > > > > > >>
> > > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> > > > > > wangguoz@gmail.com>
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > > >> >
> > > > > > > > > >> > I've also thought about this (i.e. using CommitFailed
> > for
> > > > all
> > > > > > > > > >> non-fatal),
> > > > > > > > > >> > but what I'm pondering is that, in the catch
> > > (CommitFailed)
> > > > > > block,
> > > > > > > > > what
> > > > > > > > > >> > would happen if the `producer.abortTransaction();`
> > throws
> > > > > again?
> > > > > > > > > should
> > > > > > > > > >> > that be captured as a fatal and cause the client to
> > close
> > > > > again.
> > > > > > > > > >> >
> > > > > > > > > >> > If yes, then naively the pattern would be:
> > > > > > > > > >> >
> > > > > > > > > >> > ...
> > > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > > >> >         // Transaction commit failed with abortable
> > error,
> > > > > user
> > > > > > > could
> > > > > > > > > >> reset
> > > > > > > > > >> >         // the application state and resume with a new
> > > > > > > transaction.
> > > > > > > > > The
> > > > > > > > > >> > root
> > > > > > > > > >> >         // cause was wrapped in the thrown exception.
> > > > > > > > > >> >         resetToLastCommittedPositions(consumer);
> > > > > > > > > >> >         try {
> > > > > > > > > >> >             producer.abortTransaction();
> > > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > > >> >             producer.close();
> > > > > > > > > >> >             consumer.close();
> > > > > > > > > >> >             throw e;
> > > > > > > > > >> >         }
> > > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > > >> >         producer.close();
> > > > > > > > > >> >         consumer.close();
> > > > > > > > > >> >         throw e;
> > > > > > > > > >> >     }
> > > > > > > > > >> > ...
> > > > > > > > > >> >
> > > > > > > > > >> > Guozhang
> > > > > > > > > >> >
> > > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > > >> > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> >> Hey Guozhang,
> > > > > > > > > >> >>
> > > > > > > > > >> >> Jason and I were discussing the new API offline and
> > > decided
> > > > > to
> > > > > > > take
> > > > > > > > > >> >> another
> > > > > > > > > >> >> approach. Firstly, the reason not to invent a new API
> > > with
> > > > > > > returned
> > > > > > > > > >> >> boolean
> > > > > > > > > >> >> flag is for compatibility consideration, since old
> EOS
> > > code
> > > > > > > would not
> > > > > > > > > >> know
> > > > > > > > > >> >> that a given transaction commit was failed internally
> > as
> > > > they
> > > > > > > don't
> > > > > > > > > >> listen
> > > > > > > > > >> >> to the output flag. Our proposed alternative solution
> > is
> > > to
> > > > > let
> > > > > > > > > >> >> *commitTransaction
> > > > > > > > > >> >> throw CommitFailedException whenever the commit
> failed
> > > with
> > > > > > > non-fatal
> > > > > > > > > >> >> exception*, so that on the caller side the handling
> > logic
> > > > > > > becomes:
> > > > > > > > > >> >>
> > > > > > > > > >> >> try {
> > > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > > >> >>             producer.commitTransaction();
> > > > > > > > > >> >>         } else {
> > > > > > > > > >> >>             resetToLastCommittedPositions(consumer);
> > > > > > > > > >> >>             producer.abortTransaction();
> > > > > > > > > >> >>         }
> > > > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > > > >> >>         // Transaction commit failed with abortable
> > > error,
> > > > > user
> > > > > > > could
> > > > > > > > > >> >> reset
> > > > > > > > > >> >>         // the application state and resume with a
> new
> > > > > > > transaction.
> > > > > > > > > The
> > > > > > > > > >> >> root
> > > > > > > > > >> >>         // cause was wrapped in the thrown exception.
> > > > > > > > > >> >>         resetToLastCommittedPositions(consumer);
> > > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > > >> >>         producer.close();
> > > > > > > > > >> >>         consumer.close();
> > > > > > > > > >> >>         throw e;
> > > > > > > > > >> >>     }
> > > > > > > > > >> >>
> > > > > > > > > >> >> This approach looks cleaner as all exception types
> > other
> > > > than
> > > > > > > > > >> CommitFailed
> > > > > > > > > >> >> will doom to be fatal, which is very easy to adopt
> for
> > > > users.
> > > > > > In
> > > > > > > the
> > > > > > > > > >> >> meantime, we still maintain the commitTxn behavior to
> > > throw
> > > > > > > instead
> > > > > > > > > of
> > > > > > > > > >> >> silently failing.
> > > > > > > > > >> >>
> > > > > > > > > >> >> In addition, we decided to drop the recommendation to
> > > > handle
> > > > > > > > > >> >> TimeoutException and leave it to the users to make
> the
> > > > call.
> > > > > > The
> > > > > > > > > >> downside
> > > > > > > > > >> >> for blindly calling abortTxn upon timeout is that we
> > > could
> > > > > > > result in
> > > > > > > > > an
> > > > > > > > > >> >> illegal state when the commit was already successful
> on
> > > the
> > > > > > > broker
> > > > > > > > > >> >> side. Without a good guarantee on the outcome,
> > > > > overcomplicating
> > > > > > > the
> > > > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > > > >> >>
> > > > > > > > > >> >> Let me know your thoughts on the new approach here,
> > thank
> > > > > you!
> > > > > > > > > >> >>
> > > > > > > > > >> >> Best,
> > > > > > > > > >> >> Boyang
> > > > > > > > > >> >>
> > > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <
> > > > > > > wangguoz@gmail.com>
> > > > > > > > > >> >> wrote:
> > > > > > > > > >> >>
> > > > > > > > > >> >> > Thanks for your clarification on 2)/3), that makes
> > > sense.
> > > > > > > > > >> >> >
> > > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > > >> >> > wrote:
> > > > > > > > > >> >> >
> > > > > > > > > >> >> > > Thanks for the input Guozhang, replied inline.
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> > > > > > > > > wangguoz@gmail.com>
> > > > > > > > > >> >> > wrote:
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > Thanks for the updated KIP. I read it again and
> > > have
> > > > > the
> > > > > > > > > >> following
> > > > > > > > > >> >> > > > thoughts:
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > 0. I'm a bit concerned that if commitTxn does
> not
> > > > throw
> > > > > > any
> > > > > > > > > >> >> non-fatal
> > > > > > > > > >> >> > > > exception, and instead we rely on the
> subsequent
> > > > > beginTxn
> > > > > > > call
> > > > > > > > > to
> > > > > > > > > >> >> > throw,
> > > > > > > > > >> >> > > it
> > > > > > > > > >> >> > > > may violate some callers with a pattern that
> > > relying
> > > > on
> > > > > > > > > >> commitTxn to
> > > > > > > > > >> >> > > > succeed to make some non-rollback operations.
> For
> > > > > > example:
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > beginTxn()
> > > > > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > > > > >> >> > > > commitTxn()
> > > > > > > > > >> >> > > > // if commitTxn succeeds, then commit the DB
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > -------------
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > The issue is that, committing DB is a
> > non-rollback
> > > > > > > operation,
> > > > > > > > > and
> > > > > > > > > >> >> users
> > > > > > > > > >> >> > > may
> > > > > > > > > >> >> > > > just rely on commitTxn to return without error
> to
> > > > make
> > > > > > this
> > > > > > > > > >> >> > non-rollback
> > > > > > > > > >> >> > > > call. Of course we can just claim this pattern
> is
> > > not
> > > > > > > > > legitimate
> > > > > > > > > >> >> and is
> > > > > > > > > >> >> > > not
> > > > > > > > > >> >> > > > the right way of doing things, but many users
> may
> > > > > > naturally
> > > > > > > > > adopt
> > > > > > > > > >> >> this
> > > > > > > > > >> >> > > > pattern.
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > So maybe we should still let commitTxn also
> throw
> > > > > > non-fatal
> > > > > > > > > >> >> exceptions,
> > > > > > > > > >> >> > > in
> > > > > > > > > >> >> > > > which case we would then call abortTxn again.
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > But if we do this, the pattern becomes:
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > try {
> > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > >> >> > > >    // do something
> > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > >> >> > > > }
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > > >> >> > > >     try {
> > > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > > >> >> > > >     } catch (...) {        // enumerate all
> fatal
> > > > > > > exceptions
> > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > > >> >> > > >     }
> > > > > > > > > >> >> > > > }
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > > > >> >> > > > try {
> > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > > >> >> > > >         shutdown()
> > > > > > > > > >> >> > > >     }
> > > > > > > > > >> >> > > > }
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > ---------------------
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > Thank makes me think, the alternative we have
> > > > discussed
> > > > > > > offline
> > > > > > > > > >> may
> > > > > > > > > >> >> be
> > > > > > > > > >> >> > > > better: let commitTxn() to return a boolean
> flag.
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > * If it returns true, it means the commit
> > > succeeded.
> > > > > > Users
> > > > > > > can
> > > > > > > > > >> >> > > comfortably
> > > > > > > > > >> >> > > > continue and do any external non-rollback
> > > operations
> > > > if
> > > > > > > they
> > > > > > > > > >> like.
> > > > > > > > > >> >> > > > * If it returns false, it means the commit
> failed
> > > > with
> > > > > > > > > non-fatal
> > > > > > > > > >> >> error
> > > > > > > > > >> >> > > *AND
> > > > > > > > > >> >> > > > the txn has been aborted*. Users do not need to
> > > call
> > > > > > > abortTxn
> > > > > > > > > >> again.
> > > > > > > > > >> >> > > > * If it throws, then it means fatal errors.
> Users
> > > > > should
> > > > > > > shut
> > > > > > > > > >> down
> > > > > > > > > >> >> the
> > > > > > > > > >> >> > > > client.
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > try {
> > > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > > >> >> > > >    // do something
> > > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > > >> >> > > > }
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > try {
> > > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > > > > >> >> > > >     } else {
> > > > > > > > > >> >> > > >         // reset offsets, rollback operations,
> > etc
> > > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > > >> >> > > >     }
> > > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > > >> >> > > >     shutdown()
> > > > > > > > > >> >> > > > }
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > > >> >> > > >    // do other non-rollbackable things
> > > > > > > > > >> >> > > > else
> > > > > > > > > >> >> > > >    // reset offsets, rollback operations, etc
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > -------------------------
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > Of course, if we want to have better visibility
> > > into
> > > > > what
> > > > > > > > > caused
> > > > > > > > > >> the
> > > > > > > > > >> >> > > commit
> > > > > > > > > >> >> > > > to fail and txn to abort. We can let the return
> > > type
> > > > be
> > > > > > an
> > > > > > > enum
> > > > > > > > > >> >> instead
> > > > > > > > > >> >> > > of
> > > > > > > > > >> >> > > > a flag. But my main idea is to still let the
> > > > commitTxn
> > > > > be
> > > > > > > the
> > > > > > > > > >> final
> > > > > > > > > >> >> > point
> > > > > > > > > >> >> > > > users can tell whether this txn succeeded or
> not,
> > > > > instead
> > > > > > > of
> > > > > > > > > >> >> relying on
> > > > > > > > > >> >> > > the
> > > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > I agree that adding a boolean flag is indeed
> > useful
> > > > in
> > > > > > this
> > > > > > > > > case.
> > > > > > > > > >> >> Will
> > > > > > > > > >> >> > > update the KIP.
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> > > 1. Re: "while maintaining the behavior to throw
> > fatal
> > > > > > > exception
> > > > > > > > > in
> > > > > > > > > >> >> raw"
> > > > > > > > > >> >> > not
> > > > > > > > > >> >> > > > sure what do you mean by "throw" here. Are you
> > > > > proposing
> > > > > > > the
> > > > > > > > > >> >> callback
> > > > > > > > > >> >> > > would
> > > > > > > > > >> >> > > > *pass* (not throw) in any fatal exceptions when
> > > > > triggered
> > > > > > > > > without
> > > > > > > > > >> >> > > wrapping?
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > Yes, I want to say *pass*, the benefit is to
> make
> > > the
> > > > > end
> > > > > > > > > user's
> > > > > > > > > >> >> > > expectation consistent
> > > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> > > > 2. I'm not sure I understand the claim
> regarding
> > > the
> > > > > > > callback
> > > > > > > > > >> that
> > > > > > > > > >> >> "In
> > > > > > > > > >> >> > > EOS
> > > > > > > > > >> >> > > > setup, it is not required to handle the
> > exception".
> > > > Are
> > > > > > you
> > > > > > > > > >> >> proposing
> > > > > > > > > >> >> > > that,
> > > > > > > > > >> >> > > > e.g. in Streams, we do not try to handle any
> > > > exceptions
> > > > > > if
> > > > > > > EOS
> > > > > > > > > is
> > > > > > > > > >> >> > enabled
> > > > > > > > > >> >> > > > in the callback anymore, but just let
> commitTxn()
> > > > > itself
> > > > > > to
> > > > > > > > > fail
> > > > > > > > > >> to
> > > > > > > > > >> >> be
> > > > > > > > > >> >> > > > notified about the problem? Personally I think
> in
> > > > > Streams
> > > > > > > we
> > > > > > > > > >> should
> > > > > > > > > >> >> > just
> > > > > > > > > >> >> > > > make the handling logic of the callback to be
> > > > > consistent
> > > > > > > > > >> regardless
> > > > > > > > > >> >> of
> > > > > > > > > >> >> > > the
> > > > > > > > > >> >> > > > EOS settings (today we have different logic
> > > depending
> > > > > on
> > > > > > > this
> > > > > > > > > >> logic,
> > > > > > > > > >> >> > > which
> > > > > > > > > >> >> > > > I think could be unified as well).
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > My idea originates from the claim on send API:
> > > > > > > > > >> >> > > "When used as part of a transaction, it is not
> > > > necessary
> > > > > to
> > > > > > > > > define
> > > > > > > > > >> a
> > > > > > > > > >> >> > > callback or check the result of the future  in
> > order
> > > to
> > > > > > > detect
> > > > > > > > > >> errors
> > > > > > > > > >> >> > from
> > > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > > >> >> > > My understanding is that for EOS, the exception
> > will
> > > be
> > > > > > > detected
> > > > > > > > > by
> > > > > > > > > >> >> > calling
> > > > > > > > > >> >> > > transactional APIs either way, so it is a
> duplicate
> > > > > > handling
> > > > > > > to
> > > > > > > > > >> track
> > > > > > > > > >> >> > > all the sendExceptions in RecordCollector.
> > However, I
> > > > > > looked
> > > > > > > up
> > > > > > > > > >> >> > > sendException is being used today as follow:
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for any
> > > default
> > > > > or
> > > > > > > > > >> customized
> > > > > > > > > >> >> > > exception handler to handle
> > > > > > > > > >> >> > > 2. Stop collecting offset info or new exceptions
> > > > > > > > > >> >> > > 3. Check and rethrow exceptions in close(),
> flush()
> > > or
> > > > > new
> > > > > > > send()
> > > > > > > > > >> >> calls
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> > > To my understanding, we should still honor the
> > > > commitment
> > > > > > to
> > > > > > > #1
> > > > > > > > > for
> > > > > > > > > >> >> any
> > > > > > > > > >> >> > > user customized implementation. The #2 does not
> > > really
> > > > > > > affect our
> > > > > > > > > >> >> > decision
> > > > > > > > > >> >> > > upon EOS. The #3 here is still valuable as it
> could
> > > > help
> > > > > us
> > > > > > > fail
> > > > > > > > > >> fast
> > > > > > > > > >> >> in
> > > > > > > > > >> >> > > new send() instead of waiting to later stage of
> > > > > processing.
> > > > > > > In
> > > > > > > > > that
> > > > > > > > > >> >> > sense,
> > > > > > > > > >> >> > > I agree we should continue to record send
> > exceptions
> > > > even
> > > > > > > under
> > > > > > > > > EOS
> > > > > > > > > >> >> case
> > > > > > > > > >> >> > to
> > > > > > > > > >> >> > > ensure the strength of stream side Producer
> logic.
> > On
> > > > the
> > > > > > > safer
> > > > > > > > > >> side,
> > > > > > > > > >> >> we
> > > > > > > > > >> >> > no
> > > > > > > > > >> >> > > longer need to wrap certain fatal exceptions like
> > > > > > > ProducerFenced
> > > > > > > > > as
> > > > > > > > > >> >> > > TaskMigrated, since they should not crash the
> > stream
> > > > > thread
> > > > > > > if
> > > > > > > > > >> thrown
> > > > > > > > > >> >> in
> > > > > > > > > >> >> > > raw format, once we adopt the new processing
> model
> > in
> > > > the
> > > > > > > send
> > > > > > > > > >> phase.
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > Guozhang
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > > >> >> > > > wrote:
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > > Thanks for everyone's feedback so far. I have
> > > > > polished
> > > > > > > the
> > > > > > > > > KIP
> > > > > > > > > >> >> after
> > > > > > > > > >> >> > > > > offline discussion with some folks working on
> > EOS
> > > > to
> > > > > > > make the
> > > > > > > > > >> >> > exception
> > > > > > > > > >> >> > > > > handling more lightweight. The essential
> change
> > > is
> > > > > that
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > >> not
> > > > > > > > > >> >> > > > > inventing a new intermediate exception type,
> > but
> > > > > > instead
> > > > > > > > > >> >> separating a
> > > > > > > > > >> >> > > > full
> > > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > > >> >> > > > >
> > > > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > > >> >> > > > >
> > > > > > > > > >> >> > > > > This way for any exception thrown from phase
> 1,
> > > > will
> > > > > be
> > > > > > > an
> > > > > > > > > >> >> indicator
> > > > > > > > > >> >> > in
> > > > > > > > > >> >> > > > > phase 2 whether we should do commit or abort,
> > and
> > > > > from
> > > > > > > now on
> > > > > > > > > >> >> > > > > `commitTransaction` should only throw fatal
> > > > > exceptions,
> > > > > > > > > >> similar to
> > > > > > > > > >> >> > > > > `abortTransaction`, so that any
> KafkaException
> > > > caught
> > > > > > in
> > > > > > > the
> > > > > > > > > >> >> commit
> > > > > > > > > >> >> > > phase
> > > > > > > > > >> >> > > > > will be definitely fatal to crash the app.
> For
> > > more
> > > > > > > advanced
> > > > > > > > > >> users
> > > > > > > > > >> >> > such
> > > > > > > > > >> >> > > > as
> > > > > > > > > >> >> > > > > Streams, we have the ability to further wrap
> > > > selected
> > > > > > > types
> > > > > > > > > of
> > > > > > > > > >> >> fatal
> > > > > > > > > >> >> > > > > exceptions to trigger task migration if
> > > necessary.
> > > > > > > > > >> >> > > > >
> > > > > > > > > >> >> > > > > More details in the KIP, feel free to take
> > > another
> > > > > > look,
> > > > > > > > > >> thanks!
> > > > > > > > > >> >> > > > >
> > > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > > >> >> > > > > wrote:
> > > > > > > > > >> >> > > > >
> > > > > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > > > > >> >> > > > > >
> > > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno
> Cadonna
> > <
> > > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > > >> >> > > > wrote:
> > > > > > > > > >> >> > > > > >
> > > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >> Like Matthias, I do also not know the
> > producer
> > > > > > > internal
> > > > > > > > > well
> > > > > > > > > >> >> > enough
> > > > > > > > > >> >> > > to
> > > > > > > > > >> >> > > > > >> comment on the categorization. However, I
> > > think
> > > > > > > having a
> > > > > > > > > >> super
> > > > > > > > > >> >> > > > exception
> > > > > > > > > >> >> > > > > >> (e.g. RetriableException) that  encodes if
> > an
> > > > > > > exception is
> > > > > > > > > >> >> fatal
> > > > > > > > > >> >> > or
> > > > > > > > > >> >> > > > not
> > > > > > > > > >> >> > > > > >> is cleaner, better understandable and less
> > > > > > > error-prone,
> > > > > > > > > >> because
> > > > > > > > > >> >> > > > ideally
> > > > > > > > > >> >> > > > > >> when you add a new non-fatal exception in
> > > future
> > > > > you
> > > > > > > just
> > > > > > > > > >> need
> > > > > > > > > >> >> to
> > > > > > > > > >> >> > > > think
> > > > > > > > > >> >> > > > > >> about letting it inherit from the super
> > > > exception
> > > > > > and
> > > > > > > all
> > > > > > > > > >> the
> > > > > > > > > >> >> rest
> > > > > > > > > >> >> > > of
> > > > > > > > > >> >> > > > > >> the code will just behave correctly
> without
> > > the
> > > > > need
> > > > > > > to
> > > > > > > > > wrap
> > > > > > > > > >> >> the
> > > > > > > > > >> >> > new
> > > > > > > > > >> >> > > > > >> exception into another exception each time
> > it
> > > is
> > > > > > > thrown
> > > > > > > > > >> (maybe
> > > > > > > > > >> >> it
> > > > > > > > > >> >> > is
> > > > > > > > > >> >> > > > > >> thrown at different location in the code).
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >> As far as I understand the following
> > statement
> > > > > from
> > > > > > > your
> > > > > > > > > >> >> previous
> > > > > > > > > >> >> > > > e-mail
> > > > > > > > > >> >> > > > > >> is the reason that currently such a super
> > > > > exception
> > > > > > > is not
> > > > > > > > > >> >> > possible:
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >> "Right now we have RetriableException
> type,
> > if
> > > > we
> > > > > > are
> > > > > > > > > going
> > > > > > > > > >> to
> > > > > > > > > >> >> > add a
> > > > > > > > > >> >> > > > > >> `ProducerRetriableException` type, we have
> > to
> > > > put
> > > > > > > this new
> > > > > > > > > >> >> > interface
> > > > > > > > > >> >> > > > as
> > > > > > > > > >> >> > > > > >> the parent of the RetriableException,
> > because
> > > > not
> > > > > > all
> > > > > > > > > thrown
> > > > > > > > > >> >> > > non-fatal
> > > > > > > > > >> >> > > > > >> exceptions are `retriable` in general for
> > > > > producer"
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >> In the list of exceptions in your KIP, I
> > found
> > > > > > > non-fatal
> > > > > > > > > >> >> > exceptions
> > > > > > > > > >> >> > > > that
> > > > > > > > > >> >> > > > > >> do not inherit from RetriableException. I
> > > guess
> > > > > > those
> > > > > > > are
> > > > > > > > > >> the
> > > > > > > > > >> >> ones
> > > > > > > > > >> >> > > you
> > > > > > > > > >> >> > > > > >> are referring to in your statement:
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >> All of those exceptions are non-fatal and
> do
> > > not
> > > > > > > inherit
> > > > > > > > > >> from
> > > > > > > > > >> >> > > > > >> RetriableException. Is there a reason for
> > > that?
> > > > If
> > > > > > > they
> > > > > > > > > >> >> depended
> > > > > > > > > >> >> > > from
> > > > > > > > > >> >> > > > > >> RetriableException we would be a bit
> closer
> > > to a
> > > > > > super
> > > > > > > > > >> >> exception I
> > > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >> The reason is that sender may catch those
> > > > > exceptions
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > >> >> > > > > > ProduceResponse, and it currently does
> > infinite
> > > > > > > > > >> >> > > > > > retries on RetriableException. To make sure
> > we
> > > > > could
> > > > > > > > > trigger
> > > > > > > > > >> the
> > > > > > > > > >> >> > > > > > abortTransaction() by catching non-fatal
> > thrown
> > > > > > > > > >> >> > > > > > exceptions and reinitialize the txn state,
> we
> > > > chose
> > > > > > > not to
> > > > > > > > > >> let
> > > > > > > > > >> >> > those
> > > > > > > > > >> >> > > > > > exceptions inherit RetriableException so
> that
> > > > > > > > > >> >> > > > > > they won't cause infinite retry on sender.
> > > > > > > > > >> >> > > > > >
> > > > > > > > > >> >> > > > > >
> > > > > > > > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > > >> >> > > > think
> > > > > > > > > >> >> > > > > >> to understand that their fatality depends
> on
> > > the
> > > > > > type
> > > > > > > > > (i.e.
> > > > > > > > > >> >> > > > > >> configuration) of the producer. That makes
> > it
> > > > > > > difficult to
> > > > > > > > > >> >> have a
> > > > > > > > > >> >> > > > super
> > > > > > > > > >> >> > > > > >> exception that encodes the retriability as
> > > > > mentioned
> > > > > > > > > above.
> > > > > > > > > >> >> Would
> > > > > > > > > >> >> > it
> > > > > > > > > >> >> > > > be
> > > > > > > > > >> >> > > > > >> possible to introduce exceptions that
> > inherit
> > > > from
> > > > > > > > > >> >> > > RetriableException
> > > > > > > > > >> >> > > > > >> and that are thrown when those exceptions
> > are
> > > > > caught
> > > > > > > from
> > > > > > > > > >> the
> > > > > > > > > >> >> > > brokers
> > > > > > > > > >> >> > > > > >> and the type of the producer is such that
> > the
> > > > > > > exceptions
> > > > > > > > > are
> > > > > > > > > >> >> > > > retriable?
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >> Yea, I think in general the exception type
> > > > mixing
> > > > > is
> > > > > > > > > >> difficult
> > > > > > > > > >> >> to
> > > > > > > > > >> >> > > get
> > > > > > > > > >> >> > > > > > them all right. I have already proposed
> > another
> > > > > > > solution
> > > > > > > > > >> based
> > > > > > > > > >> >> on
> > > > > > > > > >> >> > my
> > > > > > > > > >> >> > > > > > offline discussion with some folks working
> on
> > > EOS
> > > > > > > > > >> >> > > > > > to make the handling more straightforward
> for
> > > end
> > > > > > users
> > > > > > > > > >> without
> > > > > > > > > >> >> the
> > > > > > > > > >> >> > > > need
> > > > > > > > > >> >> > > > > > to distinguish exception fatality.
> > > > > > > > > >> >> > > > > >
> > > > > > > > > >> >> > > > > >> Best,
> > > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > > > > > > >> >> > > > > >> > Thanks Boyang for the proposal! I made a
> > > pass
> > > > > over
> > > > > > > the
> > > > > > > > > >> list
> > > > > > > > > >> >> and
> > > > > > > > > >> >> > > here
> > > > > > > > > >> >> > > > > are
> > > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> > 0) Although this is not part of the
> public
> > > > API,
> > > > > I
> > > > > > > think
> > > > > > > > > we
> > > > > > > > > >> >> > should
> > > > > > > > > >> >> > > > make
> > > > > > > > > >> >> > > > > >> sure
> > > > > > > > > >> >> > > > > >> > that the suggested pattern (i.e. user
> can
> > > > always
> > > > > > > call
> > > > > > > > > >> >> abortTxn()
> > > > > > > > > >> >> > > > when
> > > > > > > > > >> >> > > > > >> > handling non-fatal errors) are indeed
> > > > supported.
> > > > > > > E.g. if
> > > > > > > > > >> the
> > > > > > > > > >> >> txn
> > > > > > > > > >> >> > > is
> > > > > > > > > >> >> > > > > >> already
> > > > > > > > > >> >> > > > > >> > aborted by the broker side, then users
> can
> > > > still
> > > > > > > call
> > > > > > > > > >> >> abortTxn
> > > > > > > > > >> >> > > which
> > > > > > > > > >> >> > > > > >> would
> > > > > > > > > >> >> > > > > >> > not throw another exception but just be
> > > > treated
> > > > > > as a
> > > > > > > > > >> no-op.
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> > 1) *ConcurrentTransactionsException*: I
> > > think
> > > > > this
> > > > > > > error
> > > > > > > > > >> can
> > > > > > > > > >> >> > also
> > > > > > > > > >> >> > > be
> > > > > > > > > >> >> > > > > >> > returned but not documented yet. This
> > should
> > > > be
> > > > > a
> > > > > > > > > >> non-fatal
> > > > > > > > > >> >> > error.
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this
> error
> > is
> > > > > > > returned
> > > > > > > > > from
> > > > > > > > > >> >> > broker
> > > > > > > > > >> >> > > > when
> > > > > > > > > >> >> > > > > >> txn
> > > > > > > > > >> >> > > > > >> > state transition failed (e.g. it is
> trying
> > > to
> > > > > > > transit to
> > > > > > > > > >> >> > > > > complete-commit
> > > > > > > > > >> >> > > > > >> > while the current state is not
> > > > prepare-commit).
> > > > > > This
> > > > > > > > > error
> > > > > > > > > >> >> could
> > > > > > > > > >> >> > > > > >> indicates
> > > > > > > > > >> >> > > > > >> > a bug on the client internal code or the
> > > > broker
> > > > > > > code,
> > > > > > > > > OR a
> > > > > > > > > >> >> user
> > > > > > > > > >> >> > > > error
> > > > > > > > > >> >> > > > > >> --- a
> > > > > > > > > >> >> > > > > >> > similar error is
> > > > > ConcurrentTransactionsException,
> > > > > > > i.e.
> > > > > > > > > if
> > > > > > > > > >> >> Kafka
> > > > > > > > > >> >> > is
> > > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > > >> >> > > > > >> > these exceptions would only be returned
> if
> > > > users
> > > > > > > try to
> > > > > > > > > do
> > > > > > > > > >> >> > > something
> > > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > > >> >> > > > > >> > e.g. calling abortTxn right after a
> > > commitTxn,
> > > > > > etc.
> > > > > > > So
> > > > > > > > > I'm
> > > > > > > > > >> >> > > thinking
> > > > > > > > > >> >> > > > it
> > > > > > > > > >> >> > > > > >> > should be a non-fatal error instead of a
> > > fatal
> > > > > > > error,
> > > > > > > > > >> wdyt?
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i "indicates
> > fatal
> > > > > > > > > transactional
> > > > > > > > > >> >> > > sequence
> > > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit conflicting with
> > the
> > > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > > >> >> > > > > >> that
> > > > > > > > > >> >> > > > > >> > is treated as non-fatal. I guess your
> > > proposal
> > > > > is
> > > > > > > that
> > > > > > > > > >> >> > > > > >> > OutOfOrderSequenceException would be
> > treated
> > > > > > either
> > > > > > > as
> > > > > > > > > >> fatal
> > > > > > > > > >> >> > with
> > > > > > > > > >> >> > > > > >> > transactional producer, or non-fatal
> with
> > > > > > idempotent
> > > > > > > > > >> >> producer,
> > > > > > > > > >> >> > is
> > > > > > > > > >> >> > > > that
> > > > > > > > > >> >> > > > > >> > right? If the producer is only
> configured
> > > with
> > > > > > > > > idempotency
> > > > > > > > > >> >> but
> > > > > > > > > >> >> > not
> > > > > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > > >> >> > > for
> > > > > > > > > >> >> > > > > >> > non-fatal errors would be confusing
> since
> > > > users
> > > > > > are
> > > > > > > not
> > > > > > > > > >> using
> > > > > > > > > >> >> > > > > >> transactions
> > > > > > > > > >> >> > > > > >> > at all.. So I suggest we always throw
> > > > > > > > > >> OutOfSequenceException
> > > > > > > > > >> >> > as-is
> > > > > > > > > >> >> > > > > (i.e.
> > > > > > > > > >> >> > > > > >> > not wrapped) no matter how the producer
> is
> > > > > > > configured,
> > > > > > > > > and
> > > > > > > > > >> >> let
> > > > > > > > > >> >> > the
> > > > > > > > > >> >> > > > > >> caller
> > > > > > > > > >> >> > > > > >> > decide how to handle it based on whether
> > it
> > > is
> > > > > > only
> > > > > > > > > >> >> idempotent
> > > > > > > > > >> >> > or
> > > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the
> `send()`
> > > > > > callback /
> > > > > > > > > >> future
> > > > > > > > > >> >> can
> > > > > > > > > >> >> > > also
> > > > > > > > > >> >> > > > > >> throw
> > > > > > > > > >> >> > > > > >> > txn-related exceptions, I think this KIP
> > > > should
> > > > > > also
> > > > > > > > > cover
> > > > > > > > > >> >> this
> > > > > > > > > >> >> > > API
> > > > > > > > > >> >> > > > as
> > > > > > > > > >> >> > > > > >> well?
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> > 5) This is related to 1/2) above:
> > sometimes
> > > > > those
> > > > > > > > > >> non-fatal
> > > > > > > > > >> >> > errors
> > > > > > > > > >> >> > > > > like
> > > > > > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not
> > due
> > > > to
> > > > > > the
> > > > > > > > > state
> > > > > > > > > >> >> being
> > > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > > >> >> > > > > >> > at the broker side, but maybe users are
> > > doing
> > > > > > > something
> > > > > > > > > >> >> wrong.
> > > > > > > > > >> >> > So
> > > > > > > > > >> >> > > > I'm
> > > > > > > > > >> >> > > > > >> > wondering if we should further
> distinguish
> > > > those
> > > > > > > > > non-fatal
> > > > > > > > > >> >> > errors
> > > > > > > > > >> >> > > > > >> between
> > > > > > > > > >> >> > > > > >> > a) those that are caused by Kafka
> itself,
> > > > e.g. a
> > > > > > > broker
> > > > > > > > > >> timed
> > > > > > > > > >> >> > out
> > > > > > > > > >> >> > > > and
> > > > > > > > > >> >> > > > > >> > aborted a txn and later an endTxn
> request
> > is
> > > > > > > received,
> > > > > > > > > >> and b)
> > > > > > > > > >> >> > the
> > > > > > > > > >> >> > > > > user's
> > > > > > > > > >> >> > > > > >> > API call pattern is incorrect, causing
> the
> > > > > request
> > > > > > > to be
> > > > > > > > > >> >> > rejected
> > > > > > > > > >> >> > > > with
> > > > > > > > > >> >> > > > > >> an
> > > > > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > > >> >> > > > feels
> > > > > > > > > >> >> > > > > >> to
> > > > > > > > > >> >> > > > > >> > me more like for case a), but not case
> b).
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang
> > Chen <
> > > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your proposal
> > > makes
> > > > > > sense
> > > > > > > as
> > > > > > > > > >> well,
> > > > > > > > > >> >> on
> > > > > > > > > >> >> > > the
> > > > > > > > > >> >> > > > > pro
> > > > > > > > > >> >> > > > > >> side
> > > > > > > > > >> >> > > > > >> >> we could have a universally agreed
> > > exception
> > > > > type
> > > > > > > to be
> > > > > > > > > >> >> caught
> > > > > > > > > >> >> > by
> > > > > > > > > >> >> > > > the
> > > > > > > > > >> >> > > > > >> user,
> > > > > > > > > >> >> > > > > >> >> without having an extra layer on top of
> > the
> > > > > > actual
> > > > > > > > > >> >> exceptions.
> > > > > > > > > >> >> > I
> > > > > > > > > >> >> > > > > could
> > > > > > > > > >> >> > > > > >> see
> > > > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > >> >> > > > > >> >> 1. The exception hierarchy will be more
> > > > > complex.
> > > > > > > Right
> > > > > > > > > >> now
> > > > > > > > > >> >> we
> > > > > > > > > >> >> > > have
> > > > > > > > > >> >> > > > > >> >> RetriableException type, if we are
> going
> > to
> > > > > add a
> > > > > > > > > >> >> > > > > >> >> `ProducerRetriableException` type, we
> > have
> > > to
> > > > > put
> > > > > > > this
> > > > > > > > > >> new
> > > > > > > > > >> >> > > > interface
> > > > > > > > > >> >> > > > > >> as the
> > > > > > > > > >> >> > > > > >> >> parent of the RetriableException,
> because
> > > not
> > > > > all
> > > > > > > > > thrown
> > > > > > > > > >> >> > > non-fatal
> > > > > > > > > >> >> > > > > >> >> exceptions are `retriable` in general
> for
> > > > > > > producer, for
> > > > > > > > > >> >> example
> > > > > > > > > >> >> > > > > >> >> <
> > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > >
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> >
> > > > > > > > > >> >>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > > >> >> > > > > >> >> We could have an empty interface
> > > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > > >> >> > to
> > > > > > > > > >> >> > > > let
> > > > > > > > > >> >> > > > > >> all
> > > > > > > > > >> >> > > > > >> >> the thrown exceptions implement for
> sure,
> > > but
> > > > > > it's
> > > > > > > a
> > > > > > > > > bit
> > > > > > > > > >> >> > > unorthodox
> > > > > > > > > >> >> > > > > in
> > > > > > > > > >> >> > > > > >> the
> > > > > > > > > >> >> > > > > >> >> way we deal with exceptions in general.
> > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > >> >> > > > > >> >> 2. There are cases where we throw a
> > > > > > KafkaException
> > > > > > > > > >> wrapping
> > > > > > > > > >> >> > > another
> > > > > > > > > >> >> > > > > >> >> KafkaException as either fatal or
> > > non-fatal.
> > > > If
> > > > > > we
> > > > > > > use
> > > > > > > > > an
> > > > > > > > > >> >> > > interface
> > > > > > > > > >> >> > > > > to
> > > > > > > > > >> >> > > > > >> >> solve #1, it is also required to
> > implement
> > > > > > another
> > > > > > > > > >> bloated
> > > > > > > > > >> >> > > > exception
> > > > > > > > > >> >> > > > > >> class
> > > > > > > > > >> >> > > > > >> >> which could replace KafkaException
> type,
> > as
> > > > we
> > > > > > > couldn't
> > > > > > > > > >> mark
> > > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > >> >> > > > > >> >> 3. In terms of the encapsulation,
> > wrapping
> > > > > means
> > > > > > we
> > > > > > > > > could
> > > > > > > > > >> >> limit
> > > > > > > > > >> >> > > the
> > > > > > > > > >> >> > > > > >> scope
> > > > > > > > > >> >> > > > > >> >> of affection to the producer only,
> which
> > is
> > > > > > > important
> > > > > > > > > >> since
> > > > > > > > > >> >> we
> > > > > > > > > >> >> > > > don't
> > > > > > > > > >> >> > > > > >> want
> > > > > > > > > >> >> > > > > >> >> shared exception types to implement a
> > > > > > > producer-related
> > > > > > > > > >> >> > interface,
> > > > > > > > > >> >> > > > > such
> > > > > > > > > >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias
> > J.
> > > > Sax
> > > > > <
> > > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > >> >> > > > > >> >>> Overall, categorizing exceptions
> makes a
> > > lot
> > > > > of
> > > > > > > sense.
> > > > > > > > > >> As I
> > > > > > > > > >> >> > > don't
> > > > > > > > > >> >> > > > > know
> > > > > > > > > >> >> > > > > >> >>> the producer internals well enough, I
> > > cannot
> > > > > > > comment
> > > > > > > > > on
> > > > > > > > > >> the
> > > > > > > > > >> >> > > > > >> >>> categorization in detail though.
> > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > >> >> > > > > >> >>> What I am wondering is, if we should
> > > > introduce
> > > > > > an
> > > > > > > > > >> exception
> > > > > > > > > >> >> > > > > interface
> > > > > > > > > >> >> > > > > >> >>> that non-fatal exception would
> implement
> > > > > instead
> > > > > > > of
> > > > > > > > > >> >> creating a
> > > > > > > > > >> >> > > new
> > > > > > > > > >> >> > > > > >> class
> > > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal exceptions?
> > What
> > > > > would
> > > > > > > be the
> > > > > > > > > >> >> > pros/cons
> > > > > > > > > >> >> > > > for
> > > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen
> wrote:
> > > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > >> >> > > > > >> >>>> I would like to start a discussion
> > thread
> > > > for
> > > > > > > > > KIP-691:
> > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > >
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> >
> > > > > > > > > >> >>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify the
> > > exception
> > > > > > > handling
> > > > > > > > > >> logic
> > > > > > > > > >> >> > for
> > > > > > > > > >> >> > > > > >> >>>> transactional Producer users by
> > > classifying
> > > > > > > fatal and
> > > > > > > > > >> >> > non-fatal
> > > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > > >> >> > > > > >> >>>> and throw them correspondingly for
> > easier
> > > > > catch
> > > > > > > and
> > > > > > > > > >> retry.
> > > > > > > > > >> >> > Let
> > > > > > > > > >> >> > > me
> > > > > > > > > >> >> > > > > >> know
> > > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > > >> >> > > > > >> >>>
> > > > > > > > > >> >> > > > > >> >>
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >> >
> > > > > > > > > >> >> > > > > >>
> > > > > > > > > >> >> > > > > >
> > > > > > > > > >> >> > > > >
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > > > --
> > > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > > >> >> > > >
> > > > > > > > > >> >> > >
> > > > > > > > > >> >> >
> > > > > > > > > >> >> >
> > > > > > > > > >> >> > --
> > > > > > > > > >> >> > -- Guozhang
> > > > > > > > > >> >> >
> > > > > > > > > >> >>
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > --
> > > > > > > > > >> > -- Guozhang
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> --
> > > > > > > > > >> -- Guozhang
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Hey Guozhang, I think TimeoutException would not be covered here as it
potentially has a risk of hitting an illegal state on the broker side when
the previous commit was actually successful. Users should try to increase
their max.block.ms to avoid hitting the timeout as a base suggestion, which
is discussed in the KIP.

As for the umbrella exception, I agree it has values to some extent, but in
terms of adoption difficulty, additional exception types usually make the
error handling more complex than simplified, and we are doing our best to
avoid compatibility issues caused by wrapping previously thrown raw
exceptions that invalidates user error handling. And I'm also not in favor
of our exception hierarchy today where all exceptions are subclasses of
KafkaException, and we already mixed its handling with other specific
exception types that could be caught as superclass KafkaException. That
could be addressed in a separate proposal if we see necessary.

Boyang

On Thu, Jan 28, 2021 at 3:28 PM Guozhang Wang <wa...@gmail.com> wrote:

> Thanks Jason for the suggestion, that looks good to me too.
>
> Regarding the non-fatal exceptions wrapped as CommitFailed, I would like to
> clarify if we would cover all the following cases: 1) timeout, 2) unknown
> pid, 3) invalid pid mapping, 4) concurrent transactions?
>
> BTW I think it still makes sense to use an umbrella exception in case in
> the future we add more non-fatal cases even though today we only have a
> few.
>
>
> Guozhang
>
> On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Thanks Jason, I agree with the proposed solution here, will update the
> KIP.
> >
> > On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi Boyang,
> > >
> > > It seems like a reasonable suggestion. I wonder if a flag is sufficient
> > > though. The current `Callback` documentation treats "fatal" errors from
> > the
> > > perspective of the individual message that was sent.
> > >
> > > ```
> > >     *                  Non-Retriable exceptions (fatal, the message
> will
> > > never be sent):
> > > ```
> > >
> > > However, we also have fatal errors from the perspective of the
> > transaction
> > > (e.g. when the producer gets fenced). Perhaps that suggests we need
> > > something richer than a boolean flag:
> > >
> > > At a high level, I think the following cases are possible:
> > >
> > > - message rejected (e.g. message too large, invalid topic)
> > > - delivery failed after retries/delivery timeout (e.g. timeout, crc
> > error,
> > > not enough replicas)
> > > - transaction failed (e.g. producer fenced, invalid transaction state)
> > >
> > > Perhaps instead we can have a type like the following:
> > >
> > > class SendFailure {
> > >   FailureType failureType;
> > >   Exception cause;
> > > }
> > >
> > > enum FailureType {
> > >   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> > > }
> > >
> > > (Not married to any of these names, just a starting point.)
> > >
> > > Then we add a new `onCompletion` as you've suggested:
> > >
> > > default void onCompletion(RecordMetadata metadata, SendFailure
> failure) {
> > >   this.onCompletion(metadata, failure.cause());
> > > }
> > >
> > > This would give streams and other applications enough information to
> know
> > > whether the message can be retried and whether the transaction can be
> > > aborted.
> > >
> > > What do you think?
> > >
> > > -Jason
> > >
> > >
> > > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Thanks Jason for the thoughts.
> > > >
> > > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <jason@confluent.io
> >
> > > > wrote:
> > > >
> > > > > Hi Boyang,
> > > > >
> > > > > Thanks for the iterations here. I think this is something we should
> > > have
> > > > > done a long time ago. It sounds like there are two API changes
> here:
> > > > >
> > > > > 1. We are introducing the `CommitFailedException` to wrap abortable
> > > > > errors that are raised from `commitTransaction`. This sounds fine
> to
> > > me.
> > > > As
> > > > > far as I know, the only case we might need this is when we add
> > support
> > > to
> > > > > let producers recover from coordinator timeouts. Are there any
> > others?
> > > > >
> > > > > I think the purpose here is to ensure non-fatal exceptions are
> > unified
> > > > under the same
> > > > exception umbrella, to make the proceeding to abort any ongoing
> > > transaction
> > > > much easier.
> > > > I don't think `coordinator timeouts` is the only case to recover
> here,
> > > > since we have other
> > > > non-fatal exceptions such as UnknownPid.
> > > >
> > > > 2. We are wrapping non-fatal errors raised from `send` as
> > > `KafkaException`.
> > > > > The motivation for this is less clear to me and it doesn't look
> like
> > > the
> > > > > example from the KIP depends on it. My concern here is
> compatibility.
> > > > > Currently we have the following documentation for the `Callback`
> API:
> > > > >
> > > > > ```
> > > > >      *                  Non-Retriable exceptions (fatal, the
> message
> > > will
> > > > > never be sent):
> > > > >      *
> > > > >      *                  InvalidTopicException
> > > > >      *                  OffsetMetadataTooLargeException
> > > > >      *                  RecordBatchTooLargeException
> > > > >      *                  RecordTooLargeException
> > > > >      *                  UnknownServerException
> > > > >      *                  UnknownProducerIdException
> > > > >      *                  InvalidProducerEpochException
> > > > >      *
> > > > >      *                  Retriable exceptions (transient, may be
> > covered
> > > > by
> > > > > increasing #.retries):
> > > > >      *
> > > > >      *                  CorruptRecordException
> > > > >      *                  InvalidMetadataException
> > > > >      *                  NotEnoughReplicasAfterAppendException
> > > > >      *                  NotEnoughReplicasException
> > > > >      *                  OffsetOutOfRangeException
> > > > >      *                  TimeoutException
> > > > >      *                  UnknownTopicOrPartitionException
> > > > > ```
> > > > >
> > > > > If we wrap all the retriable exceptions documented here as
> > > > > `KafkaException`, wouldn't that break any error handling that users
> > > might
> > > > > already have? it's gonna introduce a compatibility issue.
> > > > >
> > > > > The original intention was to simplify `send` callback error
> handling
> > > by
> > > > doing exception wrapping, as on Streams level
> > > > we have to prepare an exhausting list of exceptions to catch as
> fatal,
> > > and
> > > > the same lengthy list to catch as
> > > > non-fatal. It would be much easier if we got `hints` from the
> callback.
> > > > However,
> > > > I agree there is a compatibility concern, what about deprecating the
> > > > existing:
> > > >
> > > > void onCompletion(RecordMetadata metadata, Exception exception)
> > > >
> > > > and replace it with:
> > > >
> > > > default void onCompletion(RecordMetadata metadata, Exception
> exception,
> > > > boolean isFatal) {
> > > >   this.onCompletion(metadata, exception);
> > > > }
> > > >
> > > > to make sure new users get the benefit of understanding the fatality
> > > based
> > > > on the info presented by the producer?
> > > >
> > > > Thanks,
> > > > > Jason
> > > > >
> > > > >
> > > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <in...@hiringuru.com>
> > wrote:
> > > > >
> > > > > > Why  we are receiving all emails kindly remove us from
> > > > > > dev@kafka.apache.org we don't want to receive emails anymore.
> > > > > >
> > > > > > Thanks
> > > > > > > On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com>
> wrote:
> > > > > > >
> > > > > > >
> > > > > > > Thanks Boyang, yes I think I was confused about the different
> > > > handling
> > > > > of
> > > > > > > two abortTxn calls, and now I get it was not intentional. I
> > think I
> > > > do
> > > > > > not
> > > > > > > have more concerns.
> > > > > > >
> > > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > > reluctanthero104@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the clarification Guozhang, I got your point that
> we
> > > > want
> > > > > to
> > > > > > > > have a consistent handling of fatal exceptions being thrown
> > from
> > > > the
> > > > > > > > abortTxn. I modified the current template to move the fatal
> > > > exception
> > > > > > > > try-catch outside of the processing loop to make sure we
> could
> > > get
> > > > a
> > > > > > chance
> > > > > > > > to close consumer/producer modules. Let me know what you
> think.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Boyang
> > > > > > > >
> > > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > > > reluctanthero104@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > My understanding is that abortTransaction would only throw
> > when
> > > > the
> > > > > > > > > producer is in fatal state. For CommitFailed, the producer
> > > should
> > > > > > still
> > > > > > > > be
> > > > > > > > > in the abortable error state, so that abortTransaction call
> > > would
> > > > > not
> > > > > > > > throw.
> > > > > > > > >
> > > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <
> > > > wangguoz@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Or are you going to maintain some internal state such
> that,
> > > the
> > > > > > > > >> `abortTransaction` in the catch block would never throw
> > again?
> > > > > > > > >>
> > > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> > > > > wangguoz@gmail.com>
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >> > Hi Boyang/Jason,
> > > > > > > > >> >
> > > > > > > > >> > I've also thought about this (i.e. using CommitFailed
> for
> > > all
> > > > > > > > >> non-fatal),
> > > > > > > > >> > but what I'm pondering is that, in the catch
> > (CommitFailed)
> > > > > block,
> > > > > > > > what
> > > > > > > > >> > would happen if the `producer.abortTransaction();`
> throws
> > > > again?
> > > > > > > > should
> > > > > > > > >> > that be captured as a fatal and cause the client to
> close
> > > > again.
> > > > > > > > >> >
> > > > > > > > >> > If yes, then naively the pattern would be:
> > > > > > > > >> >
> > > > > > > > >> > ...
> > > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > > >> >         // Transaction commit failed with abortable
> error,
> > > > user
> > > > > > could
> > > > > > > > >> reset
> > > > > > > > >> >         // the application state and resume with a new
> > > > > > transaction.
> > > > > > > > The
> > > > > > > > >> > root
> > > > > > > > >> >         // cause was wrapped in the thrown exception.
> > > > > > > > >> >         resetToLastCommittedPositions(consumer);
> > > > > > > > >> >         try {
> > > > > > > > >> >             producer.abortTransaction();
> > > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > > >> >             producer.close();
> > > > > > > > >> >             consumer.close();
> > > > > > > > >> >             throw e;
> > > > > > > > >> >         }
> > > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > > >> >         producer.close();
> > > > > > > > >> >         consumer.close();
> > > > > > > > >> >         throw e;
> > > > > > > > >> >     }
> > > > > > > > >> > ...
> > > > > > > > >> >
> > > > > > > > >> > Guozhang
> > > > > > > > >> >
> > > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > > >> > wrote:
> > > > > > > > >> >
> > > > > > > > >> >> Hey Guozhang,
> > > > > > > > >> >>
> > > > > > > > >> >> Jason and I were discussing the new API offline and
> > decided
> > > > to
> > > > > > take
> > > > > > > > >> >> another
> > > > > > > > >> >> approach. Firstly, the reason not to invent a new API
> > with
> > > > > > returned
> > > > > > > > >> >> boolean
> > > > > > > > >> >> flag is for compatibility consideration, since old EOS
> > code
> > > > > > would not
> > > > > > > > >> know
> > > > > > > > >> >> that a given transaction commit was failed internally
> as
> > > they
> > > > > > don't
> > > > > > > > >> listen
> > > > > > > > >> >> to the output flag. Our proposed alternative solution
> is
> > to
> > > > let
> > > > > > > > >> >> *commitTransaction
> > > > > > > > >> >> throw CommitFailedException whenever the commit failed
> > with
> > > > > > non-fatal
> > > > > > > > >> >> exception*, so that on the caller side the handling
> logic
> > > > > > becomes:
> > > > > > > > >> >>
> > > > > > > > >> >> try {
> > > > > > > > >> >>         if (shouldCommit) {
> > > > > > > > >> >>             producer.commitTransaction();
> > > > > > > > >> >>         } else {
> > > > > > > > >> >>             resetToLastCommittedPositions(consumer);
> > > > > > > > >> >>             producer.abortTransaction();
> > > > > > > > >> >>         }
> > > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > > >> >>         // Transaction commit failed with abortable
> > error,
> > > > user
> > > > > > could
> > > > > > > > >> >> reset
> > > > > > > > >> >>         // the application state and resume with a new
> > > > > > transaction.
> > > > > > > > The
> > > > > > > > >> >> root
> > > > > > > > >> >>         // cause was wrapped in the thrown exception.
> > > > > > > > >> >>         resetToLastCommittedPositions(consumer);
> > > > > > > > >> >>         producer.abortTransaction();
> > > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > > >> >>         producer.close();
> > > > > > > > >> >>         consumer.close();
> > > > > > > > >> >>         throw e;
> > > > > > > > >> >>     }
> > > > > > > > >> >>
> > > > > > > > >> >> This approach looks cleaner as all exception types
> other
> > > than
> > > > > > > > >> CommitFailed
> > > > > > > > >> >> will doom to be fatal, which is very easy to adopt for
> > > users.
> > > > > In
> > > > > > the
> > > > > > > > >> >> meantime, we still maintain the commitTxn behavior to
> > throw
> > > > > > instead
> > > > > > > > of
> > > > > > > > >> >> silently failing.
> > > > > > > > >> >>
> > > > > > > > >> >> In addition, we decided to drop the recommendation to
> > > handle
> > > > > > > > >> >> TimeoutException and leave it to the users to make the
> > > call.
> > > > > The
> > > > > > > > >> downside
> > > > > > > > >> >> for blindly calling abortTxn upon timeout is that we
> > could
> > > > > > result in
> > > > > > > > an
> > > > > > > > >> >> illegal state when the commit was already successful on
> > the
> > > > > > broker
> > > > > > > > >> >> side. Without a good guarantee on the outcome,
> > > > overcomplicating
> > > > > > the
> > > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > > >> >>
> > > > > > > > >> >> Let me know your thoughts on the new approach here,
> thank
> > > > you!
> > > > > > > > >> >>
> > > > > > > > >> >> Best,
> > > > > > > > >> >> Boyang
> > > > > > > > >> >>
> > > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <
> > > > > > wangguoz@gmail.com>
> > > > > > > > >> >> wrote:
> > > > > > > > >> >>
> > > > > > > > >> >> > Thanks for your clarification on 2)/3), that makes
> > sense.
> > > > > > > > >> >> >
> > > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > > >> >> > wrote:
> > > > > > > > >> >> >
> > > > > > > > >> >> > > Thanks for the input Guozhang, replied inline.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> > > > > > > > wangguoz@gmail.com>
> > > > > > > > >> >> > wrote:
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > > Hello Boyang,
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Thanks for the updated KIP. I read it again and
> > have
> > > > the
> > > > > > > > >> following
> > > > > > > > >> >> > > > thoughts:
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > 0. I'm a bit concerned that if commitTxn does not
> > > throw
> > > > > any
> > > > > > > > >> >> non-fatal
> > > > > > > > >> >> > > > exception, and instead we rely on the subsequent
> > > > beginTxn
> > > > > > call
> > > > > > > > to
> > > > > > > > >> >> > throw,
> > > > > > > > >> >> > > it
> > > > > > > > >> >> > > > may violate some callers with a pattern that
> > relying
> > > on
> > > > > > > > >> commitTxn to
> > > > > > > > >> >> > > > succeed to make some non-rollback operations. For
> > > > > example:
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > beginTxn()
> > > > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > > > >> >> > > > commitTxn()
> > > > > > > > >> >> > > > // if commitTxn succeeds, then commit the DB
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > -------------
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > The issue is that, committing DB is a
> non-rollback
> > > > > > operation,
> > > > > > > > and
> > > > > > > > >> >> users
> > > > > > > > >> >> > > may
> > > > > > > > >> >> > > > just rely on commitTxn to return without error to
> > > make
> > > > > this
> > > > > > > > >> >> > non-rollback
> > > > > > > > >> >> > > > call. Of course we can just claim this pattern is
> > not
> > > > > > > > legitimate
> > > > > > > > >> >> and is
> > > > > > > > >> >> > > not
> > > > > > > > >> >> > > > the right way of doing things, but many users may
> > > > > naturally
> > > > > > > > adopt
> > > > > > > > >> >> this
> > > > > > > > >> >> > > > pattern.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > So maybe we should still let commitTxn also throw
> > > > > non-fatal
> > > > > > > > >> >> exceptions,
> > > > > > > > >> >> > > in
> > > > > > > > >> >> > > > which case we would then call abortTxn again.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > But if we do this, the pattern becomes:
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > try {
> > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > >> >> > > >    // do something
> > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > >> >> > > > }
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > > >> >> > > >     try {
> > > > > > > > >> >> > > >         commitTxn()
> > > > > > > > >> >> > > >     } catch (...) {        // enumerate all fatal
> > > > > > exceptions
> > > > > > > > >> >> > > >         shutdown()
> > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > >> >> > > >         // non-fatal
> > > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > > >> >> > > >     }
> > > > > > > > >> >> > > > }
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > > >> >> > > > try {
> > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > > >> >> > > >         // only throw fatal
> > > > > > > > >> >> > > >         shutdown()
> > > > > > > > >> >> > > >     }
> > > > > > > > >> >> > > > }
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > ---------------------
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Thank makes me think, the alternative we have
> > > discussed
> > > > > > offline
> > > > > > > > >> may
> > > > > > > > >> >> be
> > > > > > > > >> >> > > > better: let commitTxn() to return a boolean flag.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > * If it returns true, it means the commit
> > succeeded.
> > > > > Users
> > > > > > can
> > > > > > > > >> >> > > comfortably
> > > > > > > > >> >> > > > continue and do any external non-rollback
> > operations
> > > if
> > > > > > they
> > > > > > > > >> like.
> > > > > > > > >> >> > > > * If it returns false, it means the commit failed
> > > with
> > > > > > > > non-fatal
> > > > > > > > >> >> error
> > > > > > > > >> >> > > *AND
> > > > > > > > >> >> > > > the txn has been aborted*. Users do not need to
> > call
> > > > > > abortTxn
> > > > > > > > >> again.
> > > > > > > > >> >> > > > * If it throws, then it means fatal errors. Users
> > > > should
> > > > > > shut
> > > > > > > > >> down
> > > > > > > > >> >> the
> > > > > > > > >> >> > > > client.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > try {
> > > > > > > > >> >> > > >    beginTxn()
> > > > > > > > >> >> > > >    // do something
> > > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > > >> >> > > > }
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > try {
> > > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > > > >> >> > > >     } else {
> > > > > > > > >> >> > > >         // reset offsets, rollback operations,
> etc
> > > > > > > > >> >> > > >         abortTxn()
> > > > > > > > >> >> > > >     }
> > > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > > >> >> > > >     // only throw fatal
> > > > > > > > >> >> > > >     shutdown()
> > > > > > > > >> >> > > > }
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > > >> >> > > >    // do other non-rollbackable things
> > > > > > > > >> >> > > > else
> > > > > > > > >> >> > > >    // reset offsets, rollback operations, etc
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > -------------------------
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Of course, if we want to have better visibility
> > into
> > > > what
> > > > > > > > caused
> > > > > > > > >> the
> > > > > > > > >> >> > > commit
> > > > > > > > >> >> > > > to fail and txn to abort. We can let the return
> > type
> > > be
> > > > > an
> > > > > > enum
> > > > > > > > >> >> instead
> > > > > > > > >> >> > > of
> > > > > > > > >> >> > > > a flag. But my main idea is to still let the
> > > commitTxn
> > > > be
> > > > > > the
> > > > > > > > >> final
> > > > > > > > >> >> > point
> > > > > > > > >> >> > > > users can tell whether this txn succeeded or not,
> > > > instead
> > > > > > of
> > > > > > > > >> >> relying on
> > > > > > > > >> >> > > the
> > > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > I agree that adding a boolean flag is indeed
> useful
> > > in
> > > > > this
> > > > > > > > case.
> > > > > > > > >> >> Will
> > > > > > > > >> >> > > update the KIP.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 1. Re: "while maintaining the behavior to throw
> fatal
> > > > > > exception
> > > > > > > > in
> > > > > > > > >> >> raw"
> > > > > > > > >> >> > not
> > > > > > > > >> >> > > > sure what do you mean by "throw" here. Are you
> > > > proposing
> > > > > > the
> > > > > > > > >> >> callback
> > > > > > > > >> >> > > would
> > > > > > > > >> >> > > > *pass* (not throw) in any fatal exceptions when
> > > > triggered
> > > > > > > > without
> > > > > > > > >> >> > > wrapping?
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Yes, I want to say *pass*, the benefit is to make
> > the
> > > > end
> > > > > > > > user's
> > > > > > > > >> >> > > expectation consistent
> > > > > > > > >> >> > > regarding exception handling.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > > 2. I'm not sure I understand the claim regarding
> > the
> > > > > > callback
> > > > > > > > >> that
> > > > > > > > >> >> "In
> > > > > > > > >> >> > > EOS
> > > > > > > > >> >> > > > setup, it is not required to handle the
> exception".
> > > Are
> > > > > you
> > > > > > > > >> >> proposing
> > > > > > > > >> >> > > that,
> > > > > > > > >> >> > > > e.g. in Streams, we do not try to handle any
> > > exceptions
> > > > > if
> > > > > > EOS
> > > > > > > > is
> > > > > > > > >> >> > enabled
> > > > > > > > >> >> > > > in the callback anymore, but just let commitTxn()
> > > > itself
> > > > > to
> > > > > > > > fail
> > > > > > > > >> to
> > > > > > > > >> >> be
> > > > > > > > >> >> > > > notified about the problem? Personally I think in
> > > > Streams
> > > > > > we
> > > > > > > > >> should
> > > > > > > > >> >> > just
> > > > > > > > >> >> > > > make the handling logic of the callback to be
> > > > consistent
> > > > > > > > >> regardless
> > > > > > > > >> >> of
> > > > > > > > >> >> > > the
> > > > > > > > >> >> > > > EOS settings (today we have different logic
> > depending
> > > > on
> > > > > > this
> > > > > > > > >> logic,
> > > > > > > > >> >> > > which
> > > > > > > > >> >> > > > I think could be unified as well).
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > My idea originates from the claim on send API:
> > > > > > > > >> >> > > "When used as part of a transaction, it is not
> > > necessary
> > > > to
> > > > > > > > define
> > > > > > > > >> a
> > > > > > > > >> >> > > callback or check the result of the future  in
> order
> > to
> > > > > > detect
> > > > > > > > >> errors
> > > > > > > > >> >> > from
> > > > > > > > >> >> > > <code>send</code>. "
> > > > > > > > >> >> > > My understanding is that for EOS, the exception
> will
> > be
> > > > > > detected
> > > > > > > > by
> > > > > > > > >> >> > calling
> > > > > > > > >> >> > > transactional APIs either way, so it is a duplicate
> > > > > handling
> > > > > > to
> > > > > > > > >> track
> > > > > > > > >> >> > > all the sendExceptions in RecordCollector.
> However, I
> > > > > looked
> > > > > > up
> > > > > > > > >> >> > > sendException is being used today as follow:
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for any
> > default
> > > > or
> > > > > > > > >> customized
> > > > > > > > >> >> > > exception handler to handle
> > > > > > > > >> >> > > 2. Stop collecting offset info or new exceptions
> > > > > > > > >> >> > > 3. Check and rethrow exceptions in close(), flush()
> > or
> > > > new
> > > > > > send()
> > > > > > > > >> >> calls
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > To my understanding, we should still honor the
> > > commitment
> > > > > to
> > > > > > #1
> > > > > > > > for
> > > > > > > > >> >> any
> > > > > > > > >> >> > > user customized implementation. The #2 does not
> > really
> > > > > > affect our
> > > > > > > > >> >> > decision
> > > > > > > > >> >> > > upon EOS. The #3 here is still valuable as it could
> > > help
> > > > us
> > > > > > fail
> > > > > > > > >> fast
> > > > > > > > >> >> in
> > > > > > > > >> >> > > new send() instead of waiting to later stage of
> > > > processing.
> > > > > > In
> > > > > > > > that
> > > > > > > > >> >> > sense,
> > > > > > > > >> >> > > I agree we should continue to record send
> exceptions
> > > even
> > > > > > under
> > > > > > > > EOS
> > > > > > > > >> >> case
> > > > > > > > >> >> > to
> > > > > > > > >> >> > > ensure the strength of stream side Producer logic.
> On
> > > the
> > > > > > safer
> > > > > > > > >> side,
> > > > > > > > >> >> we
> > > > > > > > >> >> > no
> > > > > > > > >> >> > > longer need to wrap certain fatal exceptions like
> > > > > > ProducerFenced
> > > > > > > > as
> > > > > > > > >> >> > > TaskMigrated, since they should not crash the
> stream
> > > > thread
> > > > > > if
> > > > > > > > >> thrown
> > > > > > > > >> >> in
> > > > > > > > >> >> > > raw format, once we adopt the new processing model
> in
> > > the
> > > > > > send
> > > > > > > > >> phase.
> > > > > > > > >> >> > >
> > > > > > > > >> >> > >
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > Guozhang
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > > >> >> > > > wrote:
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > > Thanks for everyone's feedback so far. I have
> > > > polished
> > > > > > the
> > > > > > > > KIP
> > > > > > > > >> >> after
> > > > > > > > >> >> > > > > offline discussion with some folks working on
> EOS
> > > to
> > > > > > make the
> > > > > > > > >> >> > exception
> > > > > > > > >> >> > > > > handling more lightweight. The essential change
> > is
> > > > that
> > > > > > we
> > > > > > > > are
> > > > > > > > >> not
> > > > > > > > >> >> > > > > inventing a new intermediate exception type,
> but
> > > > > instead
> > > > > > > > >> >> separating a
> > > > > > > > >> >> > > > full
> > > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > > >> >> > > > >
> > > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > > >> >> > > > >
> > > > > > > > >> >> > > > > This way for any exception thrown from phase 1,
> > > will
> > > > be
> > > > > > an
> > > > > > > > >> >> indicator
> > > > > > > > >> >> > in
> > > > > > > > >> >> > > > > phase 2 whether we should do commit or abort,
> and
> > > > from
> > > > > > now on
> > > > > > > > >> >> > > > > `commitTransaction` should only throw fatal
> > > > exceptions,
> > > > > > > > >> similar to
> > > > > > > > >> >> > > > > `abortTransaction`, so that any KafkaException
> > > caught
> > > > > in
> > > > > > the
> > > > > > > > >> >> commit
> > > > > > > > >> >> > > phase
> > > > > > > > >> >> > > > > will be definitely fatal to crash the app. For
> > more
> > > > > > advanced
> > > > > > > > >> users
> > > > > > > > >> >> > such
> > > > > > > > >> >> > > > as
> > > > > > > > >> >> > > > > Streams, we have the ability to further wrap
> > > selected
> > > > > > types
> > > > > > > > of
> > > > > > > > >> >> fatal
> > > > > > > > >> >> > > > > exceptions to trigger task migration if
> > necessary.
> > > > > > > > >> >> > > > >
> > > > > > > > >> >> > > > > More details in the KIP, feel free to take
> > another
> > > > > look,
> > > > > > > > >> thanks!
> > > > > > > > >> >> > > > >
> > > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > > >> >> > > > > wrote:
> > > > > > > > >> >> > > > >
> > > > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > > > >> >> > > > > >
> > > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna
> <
> > > > > > > > >> >> bruno@confluent.io>
> > > > > > > > >> >> > > > wrote:
> > > > > > > > >> >> > > > > >
> > > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >> Like Matthias, I do also not know the
> producer
> > > > > > internal
> > > > > > > > well
> > > > > > > > >> >> > enough
> > > > > > > > >> >> > > to
> > > > > > > > >> >> > > > > >> comment on the categorization. However, I
> > think
> > > > > > having a
> > > > > > > > >> super
> > > > > > > > >> >> > > > exception
> > > > > > > > >> >> > > > > >> (e.g. RetriableException) that  encodes if
> an
> > > > > > exception is
> > > > > > > > >> >> fatal
> > > > > > > > >> >> > or
> > > > > > > > >> >> > > > not
> > > > > > > > >> >> > > > > >> is cleaner, better understandable and less
> > > > > > error-prone,
> > > > > > > > >> because
> > > > > > > > >> >> > > > ideally
> > > > > > > > >> >> > > > > >> when you add a new non-fatal exception in
> > future
> > > > you
> > > > > > just
> > > > > > > > >> need
> > > > > > > > >> >> to
> > > > > > > > >> >> > > > think
> > > > > > > > >> >> > > > > >> about letting it inherit from the super
> > > exception
> > > > > and
> > > > > > all
> > > > > > > > >> the
> > > > > > > > >> >> rest
> > > > > > > > >> >> > > of
> > > > > > > > >> >> > > > > >> the code will just behave correctly without
> > the
> > > > need
> > > > > > to
> > > > > > > > wrap
> > > > > > > > >> >> the
> > > > > > > > >> >> > new
> > > > > > > > >> >> > > > > >> exception into another exception each time
> it
> > is
> > > > > > thrown
> > > > > > > > >> (maybe
> > > > > > > > >> >> it
> > > > > > > > >> >> > is
> > > > > > > > >> >> > > > > >> thrown at different location in the code).
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >> As far as I understand the following
> statement
> > > > from
> > > > > > your
> > > > > > > > >> >> previous
> > > > > > > > >> >> > > > e-mail
> > > > > > > > >> >> > > > > >> is the reason that currently such a super
> > > > exception
> > > > > > is not
> > > > > > > > >> >> > possible:
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >> "Right now we have RetriableException type,
> if
> > > we
> > > > > are
> > > > > > > > going
> > > > > > > > >> to
> > > > > > > > >> >> > add a
> > > > > > > > >> >> > > > > >> `ProducerRetriableException` type, we have
> to
> > > put
> > > > > > this new
> > > > > > > > >> >> > interface
> > > > > > > > >> >> > > > as
> > > > > > > > >> >> > > > > >> the parent of the RetriableException,
> because
> > > not
> > > > > all
> > > > > > > > thrown
> > > > > > > > >> >> > > non-fatal
> > > > > > > > >> >> > > > > >> exceptions are `retriable` in general for
> > > > producer"
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >> In the list of exceptions in your KIP, I
> found
> > > > > > non-fatal
> > > > > > > > >> >> > exceptions
> > > > > > > > >> >> > > > that
> > > > > > > > >> >> > > > > >> do not inherit from RetriableException. I
> > guess
> > > > > those
> > > > > > are
> > > > > > > > >> the
> > > > > > > > >> >> ones
> > > > > > > > >> >> > > you
> > > > > > > > >> >> > > > > >> are referring to in your statement:
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >> All of those exceptions are non-fatal and do
> > not
> > > > > > inherit
> > > > > > > > >> from
> > > > > > > > >> >> > > > > >> RetriableException. Is there a reason for
> > that?
> > > If
> > > > > > they
> > > > > > > > >> >> depended
> > > > > > > > >> >> > > from
> > > > > > > > >> >> > > > > >> RetriableException we would be a bit closer
> > to a
> > > > > super
> > > > > > > > >> >> exception I
> > > > > > > > >> >> > > > > >> mention above.
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >> The reason is that sender may catch those
> > > > exceptions
> > > > > > in
> > > > > > > > the
> > > > > > > > >> >> > > > > > ProduceResponse, and it currently does
> infinite
> > > > > > > > >> >> > > > > > retries on RetriableException. To make sure
> we
> > > > could
> > > > > > > > trigger
> > > > > > > > >> the
> > > > > > > > >> >> > > > > > abortTransaction() by catching non-fatal
> thrown
> > > > > > > > >> >> > > > > > exceptions and reinitialize the txn state, we
> > > chose
> > > > > > not to
> > > > > > > > >> let
> > > > > > > > >> >> > those
> > > > > > > > >> >> > > > > > exceptions inherit RetriableException so that
> > > > > > > > >> >> > > > > > they won't cause infinite retry on sender.
> > > > > > > > >> >> > > > > >
> > > > > > > > >> >> > > > > >
> > > > > > > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > > >> >> > > > think
> > > > > > > > >> >> > > > > >> to understand that their fatality depends on
> > the
> > > > > type
> > > > > > > > (i.e.
> > > > > > > > >> >> > > > > >> configuration) of the producer. That makes
> it
> > > > > > difficult to
> > > > > > > > >> >> have a
> > > > > > > > >> >> > > > super
> > > > > > > > >> >> > > > > >> exception that encodes the retriability as
> > > > mentioned
> > > > > > > > above.
> > > > > > > > >> >> Would
> > > > > > > > >> >> > it
> > > > > > > > >> >> > > > be
> > > > > > > > >> >> > > > > >> possible to introduce exceptions that
> inherit
> > > from
> > > > > > > > >> >> > > RetriableException
> > > > > > > > >> >> > > > > >> and that are thrown when those exceptions
> are
> > > > caught
> > > > > > from
> > > > > > > > >> the
> > > > > > > > >> >> > > brokers
> > > > > > > > >> >> > > > > >> and the type of the producer is such that
> the
> > > > > > exceptions
> > > > > > > > are
> > > > > > > > >> >> > > > retriable?
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >> Yea, I think in general the exception type
> > > mixing
> > > > is
> > > > > > > > >> difficult
> > > > > > > > >> >> to
> > > > > > > > >> >> > > get
> > > > > > > > >> >> > > > > > them all right. I have already proposed
> another
> > > > > > solution
> > > > > > > > >> based
> > > > > > > > >> >> on
> > > > > > > > >> >> > my
> > > > > > > > >> >> > > > > > offline discussion with some folks working on
> > EOS
> > > > > > > > >> >> > > > > > to make the handling more straightforward for
> > end
> > > > > users
> > > > > > > > >> without
> > > > > > > > >> >> the
> > > > > > > > >> >> > > > need
> > > > > > > > >> >> > > > > > to distinguish exception fatality.
> > > > > > > > >> >> > > > > >
> > > > > > > > >> >> > > > > >> Best,
> > > > > > > > >> >> > > > > >> Bruno
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > > > > > >> >> > > > > >> > Thanks Boyang for the proposal! I made a
> > pass
> > > > over
> > > > > > the
> > > > > > > > >> list
> > > > > > > > >> >> and
> > > > > > > > >> >> > > here
> > > > > > > > >> >> > > > > are
> > > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> > 0) Although this is not part of the public
> > > API,
> > > > I
> > > > > > think
> > > > > > > > we
> > > > > > > > >> >> > should
> > > > > > > > >> >> > > > make
> > > > > > > > >> >> > > > > >> sure
> > > > > > > > >> >> > > > > >> > that the suggested pattern (i.e. user can
> > > always
> > > > > > call
> > > > > > > > >> >> abortTxn()
> > > > > > > > >> >> > > > when
> > > > > > > > >> >> > > > > >> > handling non-fatal errors) are indeed
> > > supported.
> > > > > > E.g. if
> > > > > > > > >> the
> > > > > > > > >> >> txn
> > > > > > > > >> >> > > is
> > > > > > > > >> >> > > > > >> already
> > > > > > > > >> >> > > > > >> > aborted by the broker side, then users can
> > > still
> > > > > > call
> > > > > > > > >> >> abortTxn
> > > > > > > > >> >> > > which
> > > > > > > > >> >> > > > > >> would
> > > > > > > > >> >> > > > > >> > not throw another exception but just be
> > > treated
> > > > > as a
> > > > > > > > >> no-op.
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> > 1) *ConcurrentTransactionsException*: I
> > think
> > > > this
> > > > > > error
> > > > > > > > >> can
> > > > > > > > >> >> > also
> > > > > > > > >> >> > > be
> > > > > > > > >> >> > > > > >> > returned but not documented yet. This
> should
> > > be
> > > > a
> > > > > > > > >> non-fatal
> > > > > > > > >> >> > error.
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this error
> is
> > > > > > returned
> > > > > > > > from
> > > > > > > > >> >> > broker
> > > > > > > > >> >> > > > when
> > > > > > > > >> >> > > > > >> txn
> > > > > > > > >> >> > > > > >> > state transition failed (e.g. it is trying
> > to
> > > > > > transit to
> > > > > > > > >> >> > > > > complete-commit
> > > > > > > > >> >> > > > > >> > while the current state is not
> > > prepare-commit).
> > > > > This
> > > > > > > > error
> > > > > > > > >> >> could
> > > > > > > > >> >> > > > > >> indicates
> > > > > > > > >> >> > > > > >> > a bug on the client internal code or the
> > > broker
> > > > > > code,
> > > > > > > > OR a
> > > > > > > > >> >> user
> > > > > > > > >> >> > > > error
> > > > > > > > >> >> > > > > >> --- a
> > > > > > > > >> >> > > > > >> > similar error is
> > > > ConcurrentTransactionsException,
> > > > > > i.e.
> > > > > > > > if
> > > > > > > > >> >> Kafka
> > > > > > > > >> >> > is
> > > > > > > > >> >> > > > > >> bug-free
> > > > > > > > >> >> > > > > >> > these exceptions would only be returned if
> > > users
> > > > > > try to
> > > > > > > > do
> > > > > > > > >> >> > > something
> > > > > > > > >> >> > > > > >> wrong,
> > > > > > > > >> >> > > > > >> > e.g. calling abortTxn right after a
> > commitTxn,
> > > > > etc.
> > > > > > So
> > > > > > > > I'm
> > > > > > > > >> >> > > thinking
> > > > > > > > >> >> > > > it
> > > > > > > > >> >> > > > > >> > should be a non-fatal error instead of a
> > fatal
> > > > > > error,
> > > > > > > > >> wdyt?
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i "indicates
> fatal
> > > > > > > > transactional
> > > > > > > > >> >> > > sequence
> > > > > > > > >> >> > > > > >> > (Fatal)", this is a bit conflicting with
> the
> > > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > > >> >> > > > > >> that
> > > > > > > > >> >> > > > > >> > is treated as non-fatal. I guess your
> > proposal
> > > > is
> > > > > > that
> > > > > > > > >> >> > > > > >> > OutOfOrderSequenceException would be
> treated
> > > > > either
> > > > > > as
> > > > > > > > >> fatal
> > > > > > > > >> >> > with
> > > > > > > > >> >> > > > > >> > transactional producer, or non-fatal with
> > > > > idempotent
> > > > > > > > >> >> producer,
> > > > > > > > >> >> > is
> > > > > > > > >> >> > > > that
> > > > > > > > >> >> > > > > >> > right? If the producer is only configured
> > with
> > > > > > > > idempotency
> > > > > > > > >> >> but
> > > > > > > > >> >> > not
> > > > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > > >> >> > > for
> > > > > > > > >> >> > > > > >> > non-fatal errors would be confusing since
> > > users
> > > > > are
> > > > > > not
> > > > > > > > >> using
> > > > > > > > >> >> > > > > >> transactions
> > > > > > > > >> >> > > > > >> > at all.. So I suggest we always throw
> > > > > > > > >> OutOfSequenceException
> > > > > > > > >> >> > as-is
> > > > > > > > >> >> > > > > (i.e.
> > > > > > > > >> >> > > > > >> > not wrapped) no matter how the producer is
> > > > > > configured,
> > > > > > > > and
> > > > > > > > >> >> let
> > > > > > > > >> >> > the
> > > > > > > > >> >> > > > > >> caller
> > > > > > > > >> >> > > > > >> > decide how to handle it based on whether
> it
> > is
> > > > > only
> > > > > > > > >> >> idempotent
> > > > > > > > >> >> > or
> > > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the `send()`
> > > > > callback /
> > > > > > > > >> future
> > > > > > > > >> >> can
> > > > > > > > >> >> > > also
> > > > > > > > >> >> > > > > >> throw
> > > > > > > > >> >> > > > > >> > txn-related exceptions, I think this KIP
> > > should
> > > > > also
> > > > > > > > cover
> > > > > > > > >> >> this
> > > > > > > > >> >> > > API
> > > > > > > > >> >> > > > as
> > > > > > > > >> >> > > > > >> well?
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> > 5) This is related to 1/2) above:
> sometimes
> > > > those
> > > > > > > > >> non-fatal
> > > > > > > > >> >> > errors
> > > > > > > > >> >> > > > > like
> > > > > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not
> due
> > > to
> > > > > the
> > > > > > > > state
> > > > > > > > >> >> being
> > > > > > > > >> >> > > > > >> corrupted
> > > > > > > > >> >> > > > > >> > at the broker side, but maybe users are
> > doing
> > > > > > something
> > > > > > > > >> >> wrong.
> > > > > > > > >> >> > So
> > > > > > > > >> >> > > > I'm
> > > > > > > > >> >> > > > > >> > wondering if we should further distinguish
> > > those
> > > > > > > > non-fatal
> > > > > > > > >> >> > errors
> > > > > > > > >> >> > > > > >> between
> > > > > > > > >> >> > > > > >> > a) those that are caused by Kafka itself,
> > > e.g. a
> > > > > > broker
> > > > > > > > >> timed
> > > > > > > > >> >> > out
> > > > > > > > >> >> > > > and
> > > > > > > > >> >> > > > > >> > aborted a txn and later an endTxn request
> is
> > > > > > received,
> > > > > > > > >> and b)
> > > > > > > > >> >> > the
> > > > > > > > >> >> > > > > user's
> > > > > > > > >> >> > > > > >> > API call pattern is incorrect, causing the
> > > > request
> > > > > > to be
> > > > > > > > >> >> > rejected
> > > > > > > > >> >> > > > with
> > > > > > > > >> >> > > > > >> an
> > > > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > > >> >> > > > feels
> > > > > > > > >> >> > > > > >> to
> > > > > > > > >> >> > > > > >> > me more like for case a), but not case b).
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang
> Chen <
> > > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > > >> >> > > > > >> > wrote:
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your proposal
> > makes
> > > > > sense
> > > > > > as
> > > > > > > > >> well,
> > > > > > > > >> >> on
> > > > > > > > >> >> > > the
> > > > > > > > >> >> > > > > pro
> > > > > > > > >> >> > > > > >> side
> > > > > > > > >> >> > > > > >> >> we could have a universally agreed
> > exception
> > > > type
> > > > > > to be
> > > > > > > > >> >> caught
> > > > > > > > >> >> > by
> > > > > > > > >> >> > > > the
> > > > > > > > >> >> > > > > >> user,
> > > > > > > > >> >> > > > > >> >> without having an extra layer on top of
> the
> > > > > actual
> > > > > > > > >> >> exceptions.
> > > > > > > > >> >> > I
> > > > > > > > >> >> > > > > could
> > > > > > > > >> >> > > > > >> see
> > > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > > >> >> > > > > >> >>
> > > > > > > > >> >> > > > > >> >> 1. The exception hierarchy will be more
> > > > complex.
> > > > > > Right
> > > > > > > > >> now
> > > > > > > > >> >> we
> > > > > > > > >> >> > > have
> > > > > > > > >> >> > > > > >> >> RetriableException type, if we are going
> to
> > > > add a
> > > > > > > > >> >> > > > > >> >> `ProducerRetriableException` type, we
> have
> > to
> > > > put
> > > > > > this
> > > > > > > > >> new
> > > > > > > > >> >> > > > interface
> > > > > > > > >> >> > > > > >> as the
> > > > > > > > >> >> > > > > >> >> parent of the RetriableException, because
> > not
> > > > all
> > > > > > > > thrown
> > > > > > > > >> >> > > non-fatal
> > > > > > > > >> >> > > > > >> >> exceptions are `retriable` in general for
> > > > > > producer, for
> > > > > > > > >> >> example
> > > > > > > > >> >> > > > > >> >> <
> > > > > > > > >> >> > > > > >> >>
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > >
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > >
> > > > > > > > >> >> >
> > > > > > > > >> >>
> > > > > > > > >>
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > > >> >> > > > > >> >>> .
> > > > > > > > >> >> > > > > >> >> We could have an empty interface
> > > > > > > > >> >> `ProducerRetriableException`
> > > > > > > > >> >> > to
> > > > > > > > >> >> > > > let
> > > > > > > > >> >> > > > > >> all
> > > > > > > > >> >> > > > > >> >> the thrown exceptions implement for sure,
> > but
> > > > > it's
> > > > > > a
> > > > > > > > bit
> > > > > > > > >> >> > > unorthodox
> > > > > > > > >> >> > > > > in
> > > > > > > > >> >> > > > > >> the
> > > > > > > > >> >> > > > > >> >> way we deal with exceptions in general.
> > > > > > > > >> >> > > > > >> >>
> > > > > > > > >> >> > > > > >> >> 2. There are cases where we throw a
> > > > > KafkaException
> > > > > > > > >> wrapping
> > > > > > > > >> >> > > another
> > > > > > > > >> >> > > > > >> >> KafkaException as either fatal or
> > non-fatal.
> > > If
> > > > > we
> > > > > > use
> > > > > > > > an
> > > > > > > > >> >> > > interface
> > > > > > > > >> >> > > > > to
> > > > > > > > >> >> > > > > >> >> solve #1, it is also required to
> implement
> > > > > another
> > > > > > > > >> bloated
> > > > > > > > >> >> > > > exception
> > > > > > > > >> >> > > > > >> class
> > > > > > > > >> >> > > > > >> >> which could replace KafkaException type,
> as
> > > we
> > > > > > couldn't
> > > > > > > > >> mark
> > > > > > > > >> >> > > > > >> KafkaException
> > > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > > >> >> > > > > >> >>
> > > > > > > > >> >> > > > > >> >> 3. In terms of the encapsulation,
> wrapping
> > > > means
> > > > > we
> > > > > > > > could
> > > > > > > > >> >> limit
> > > > > > > > >> >> > > the
> > > > > > > > >> >> > > > > >> scope
> > > > > > > > >> >> > > > > >> >> of affection to the producer only, which
> is
> > > > > > important
> > > > > > > > >> since
> > > > > > > > >> >> we
> > > > > > > > >> >> > > > don't
> > > > > > > > >> >> > > > > >> want
> > > > > > > > >> >> > > > > >> >> shared exception types to implement a
> > > > > > producer-related
> > > > > > > > >> >> > interface,
> > > > > > > > >> >> > > > > such
> > > > > > > > >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> > > > > > > > >> >> > > > > >> >>
> > > > > > > > >> >> > > > > >> >> Best,
> > > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > > >> >> > > > > >> >>
> > > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias
> J.
> > > Sax
> > > > <
> > > > > > > > >> >> > mjsax@apache.org
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > > >> wrote:
> > > > > > > > >> >> > > > > >> >>
> > > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > > > >> >> > > > > >> >>>
> > > > > > > > >> >> > > > > >> >>> Overall, categorizing exceptions makes a
> > lot
> > > > of
> > > > > > sense.
> > > > > > > > >> As I
> > > > > > > > >> >> > > don't
> > > > > > > > >> >> > > > > know
> > > > > > > > >> >> > > > > >> >>> the producer internals well enough, I
> > cannot
> > > > > > comment
> > > > > > > > on
> > > > > > > > >> the
> > > > > > > > >> >> > > > > >> >>> categorization in detail though.
> > > > > > > > >> >> > > > > >> >>>
> > > > > > > > >> >> > > > > >> >>> What I am wondering is, if we should
> > > introduce
> > > > > an
> > > > > > > > >> exception
> > > > > > > > >> >> > > > > interface
> > > > > > > > >> >> > > > > >> >>> that non-fatal exception would implement
> > > > instead
> > > > > > of
> > > > > > > > >> >> creating a
> > > > > > > > >> >> > > new
> > > > > > > > >> >> > > > > >> class
> > > > > > > > >> >> > > > > >> >>> that will wrap non-fatal exceptions?
> What
> > > > would
> > > > > > be the
> > > > > > > > >> >> > pros/cons
> > > > > > > > >> >> > > > for
> > > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > > >> >> > > > > >> >>>
> > > > > > > > >> >> > > > > >> >>>
> > > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > > >> >> > > > > >> >>>
> > > > > > > > >> >> > > > > >> >>>
> > > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > >> >> > > > > >> >>>> I would like to start a discussion
> thread
> > > for
> > > > > > > > KIP-691:
> > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > >> >> > > > > >> >>>
> > > > > > > > >> >> > > > > >> >>
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > >
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > >
> > > > > > > > >> >> >
> > > > > > > > >> >>
> > > > > > > > >>
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify the
> > exception
> > > > > > handling
> > > > > > > > >> logic
> > > > > > > > >> >> > for
> > > > > > > > >> >> > > > > >> >>>> transactional Producer users by
> > classifying
> > > > > > fatal and
> > > > > > > > >> >> > non-fatal
> > > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > > >> >> > > > > >> >>>> and throw them correspondingly for
> easier
> > > > catch
> > > > > > and
> > > > > > > > >> retry.
> > > > > > > > >> >> > Let
> > > > > > > > >> >> > > me
> > > > > > > > >> >> > > > > >> know
> > > > > > > > >> >> > > > > >> >>> what
> > > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > > >> >> > > > > >> >>>>
> > > > > > > > >> >> > > > > >> >>>
> > > > > > > > >> >> > > > > >> >>
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >> >
> > > > > > > > >> >> > > > > >>
> > > > > > > > >> >> > > > > >
> > > > > > > > >> >> > > > >
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > > > --
> > > > > > > > >> >> > > > -- Guozhang
> > > > > > > > >> >> > > >
> > > > > > > > >> >> > >
> > > > > > > > >> >> >
> > > > > > > > >> >> >
> > > > > > > > >> >> > --
> > > > > > > > >> >> > -- Guozhang
> > > > > > > > >> >> >
> > > > > > > > >> >>
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > --
> > > > > > > > >> > -- Guozhang
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> --
> > > > > > > > >> -- Guozhang
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Guozhang Wang <wa...@gmail.com>.
Thanks Jason for the suggestion, that looks good to me too.

Regarding the non-fatal exceptions wrapped as CommitFailed, I would like to
clarify if we would cover all the following cases: 1) timeout, 2) unknown
pid, 3) invalid pid mapping, 4) concurrent transactions?

BTW I think it still makes sense to use an umbrella exception in case in
the future we add more non-fatal cases even though today we only have a few.


Guozhang

On Thu, Jan 28, 2021 at 1:08 PM Boyang Chen <re...@gmail.com>
wrote:

> Thanks Jason, I agree with the proposed solution here, will update the KIP.
>
> On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Boyang,
> >
> > It seems like a reasonable suggestion. I wonder if a flag is sufficient
> > though. The current `Callback` documentation treats "fatal" errors from
> the
> > perspective of the individual message that was sent.
> >
> > ```
> >     *                  Non-Retriable exceptions (fatal, the message will
> > never be sent):
> > ```
> >
> > However, we also have fatal errors from the perspective of the
> transaction
> > (e.g. when the producer gets fenced). Perhaps that suggests we need
> > something richer than a boolean flag:
> >
> > At a high level, I think the following cases are possible:
> >
> > - message rejected (e.g. message too large, invalid topic)
> > - delivery failed after retries/delivery timeout (e.g. timeout, crc
> error,
> > not enough replicas)
> > - transaction failed (e.g. producer fenced, invalid transaction state)
> >
> > Perhaps instead we can have a type like the following:
> >
> > class SendFailure {
> >   FailureType failureType;
> >   Exception cause;
> > }
> >
> > enum FailureType {
> >   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> > }
> >
> > (Not married to any of these names, just a starting point.)
> >
> > Then we add a new `onCompletion` as you've suggested:
> >
> > default void onCompletion(RecordMetadata metadata, SendFailure failure) {
> >   this.onCompletion(metadata, failure.cause());
> > }
> >
> > This would give streams and other applications enough information to know
> > whether the message can be retried and whether the transaction can be
> > aborted.
> >
> > What do you think?
> >
> > -Jason
> >
> >
> > On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Thanks Jason for the thoughts.
> > >
> > > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi Boyang,
> > > >
> > > > Thanks for the iterations here. I think this is something we should
> > have
> > > > done a long time ago. It sounds like there are two API changes here:
> > > >
> > > > 1. We are introducing the `CommitFailedException` to wrap abortable
> > > > errors that are raised from `commitTransaction`. This sounds fine to
> > me.
> > > As
> > > > far as I know, the only case we might need this is when we add
> support
> > to
> > > > let producers recover from coordinator timeouts. Are there any
> others?
> > > >
> > > > I think the purpose here is to ensure non-fatal exceptions are
> unified
> > > under the same
> > > exception umbrella, to make the proceeding to abort any ongoing
> > transaction
> > > much easier.
> > > I don't think `coordinator timeouts` is the only case to recover here,
> > > since we have other
> > > non-fatal exceptions such as UnknownPid.
> > >
> > > 2. We are wrapping non-fatal errors raised from `send` as
> > `KafkaException`.
> > > > The motivation for this is less clear to me and it doesn't look like
> > the
> > > > example from the KIP depends on it. My concern here is compatibility.
> > > > Currently we have the following documentation for the `Callback` API:
> > > >
> > > > ```
> > > >      *                  Non-Retriable exceptions (fatal, the message
> > will
> > > > never be sent):
> > > >      *
> > > >      *                  InvalidTopicException
> > > >      *                  OffsetMetadataTooLargeException
> > > >      *                  RecordBatchTooLargeException
> > > >      *                  RecordTooLargeException
> > > >      *                  UnknownServerException
> > > >      *                  UnknownProducerIdException
> > > >      *                  InvalidProducerEpochException
> > > >      *
> > > >      *                  Retriable exceptions (transient, may be
> covered
> > > by
> > > > increasing #.retries):
> > > >      *
> > > >      *                  CorruptRecordException
> > > >      *                  InvalidMetadataException
> > > >      *                  NotEnoughReplicasAfterAppendException
> > > >      *                  NotEnoughReplicasException
> > > >      *                  OffsetOutOfRangeException
> > > >      *                  TimeoutException
> > > >      *                  UnknownTopicOrPartitionException
> > > > ```
> > > >
> > > > If we wrap all the retriable exceptions documented here as
> > > > `KafkaException`, wouldn't that break any error handling that users
> > might
> > > > already have? it's gonna introduce a compatibility issue.
> > > >
> > > > The original intention was to simplify `send` callback error handling
> > by
> > > doing exception wrapping, as on Streams level
> > > we have to prepare an exhausting list of exceptions to catch as fatal,
> > and
> > > the same lengthy list to catch as
> > > non-fatal. It would be much easier if we got `hints` from the callback.
> > > However,
> > > I agree there is a compatibility concern, what about deprecating the
> > > existing:
> > >
> > > void onCompletion(RecordMetadata metadata, Exception exception)
> > >
> > > and replace it with:
> > >
> > > default void onCompletion(RecordMetadata metadata, Exception exception,
> > > boolean isFatal) {
> > >   this.onCompletion(metadata, exception);
> > > }
> > >
> > > to make sure new users get the benefit of understanding the fatality
> > based
> > > on the info presented by the producer?
> > >
> > > Thanks,
> > > > Jason
> > > >
> > > >
> > > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <in...@hiringuru.com>
> wrote:
> > > >
> > > > > Why  we are receiving all emails kindly remove us from
> > > > > dev@kafka.apache.org we don't want to receive emails anymore.
> > > > >
> > > > > Thanks
> > > > > > On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com> wrote:
> > > > > >
> > > > > >
> > > > > > Thanks Boyang, yes I think I was confused about the different
> > > handling
> > > > of
> > > > > > two abortTxn calls, and now I get it was not intentional. I
> think I
> > > do
> > > > > not
> > > > > > have more concerns.
> > > > > >
> > > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > > reluctanthero104@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the clarification Guozhang, I got your point that we
> > > want
> > > > to
> > > > > > > have a consistent handling of fatal exceptions being thrown
> from
> > > the
> > > > > > > abortTxn. I modified the current template to move the fatal
> > > exception
> > > > > > > try-catch outside of the processing loop to make sure we could
> > get
> > > a
> > > > > chance
> > > > > > > to close consumer/producer modules. Let me know what you think.
> > > > > > >
> > > > > > > Best,
> > > > > > > Boyang
> > > > > > >
> > > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > > reluctanthero104@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > My understanding is that abortTransaction would only throw
> when
> > > the
> > > > > > > > producer is in fatal state. For CommitFailed, the producer
> > should
> > > > > still
> > > > > > > be
> > > > > > > > in the abortable error state, so that abortTransaction call
> > would
> > > > not
> > > > > > > throw.
> > > > > > > >
> > > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <
> > > wangguoz@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> Or are you going to maintain some internal state such that,
> > the
> > > > > > > >> `abortTransaction` in the catch block would never throw
> again?
> > > > > > > >>
> > > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> > > > wangguoz@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hi Boyang/Jason,
> > > > > > > >> >
> > > > > > > >> > I've also thought about this (i.e. using CommitFailed for
> > all
> > > > > > > >> non-fatal),
> > > > > > > >> > but what I'm pondering is that, in the catch
> (CommitFailed)
> > > > block,
> > > > > > > what
> > > > > > > >> > would happen if the `producer.abortTransaction();` throws
> > > again?
> > > > > > > should
> > > > > > > >> > that be captured as a fatal and cause the client to close
> > > again.
> > > > > > > >> >
> > > > > > > >> > If yes, then naively the pattern would be:
> > > > > > > >> >
> > > > > > > >> > ...
> > > > > > > >> > catch (CommitFailedException e) {
> > > > > > > >> >         // Transaction commit failed with abortable error,
> > > user
> > > > > could
> > > > > > > >> reset
> > > > > > > >> >         // the application state and resume with a new
> > > > > transaction.
> > > > > > > The
> > > > > > > >> > root
> > > > > > > >> >         // cause was wrapped in the thrown exception.
> > > > > > > >> >         resetToLastCommittedPositions(consumer);
> > > > > > > >> >         try {
> > > > > > > >> >             producer.abortTransaction();
> > > > > > > >> >         } catch (KafkaException e) {
> > > > > > > >> >             producer.close();
> > > > > > > >> >             consumer.close();
> > > > > > > >> >             throw e;
> > > > > > > >> >         }
> > > > > > > >> >     } catch (KafkaException e) {
> > > > > > > >> >         producer.close();
> > > > > > > >> >         consumer.close();
> > > > > > > >> >         throw e;
> > > > > > > >> >     }
> > > > > > > >> > ...
> > > > > > > >> >
> > > > > > > >> > Guozhang
> > > > > > > >> >
> > > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > > > > >> reluctanthero104@gmail.com>
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> >> Hey Guozhang,
> > > > > > > >> >>
> > > > > > > >> >> Jason and I were discussing the new API offline and
> decided
> > > to
> > > > > take
> > > > > > > >> >> another
> > > > > > > >> >> approach. Firstly, the reason not to invent a new API
> with
> > > > > returned
> > > > > > > >> >> boolean
> > > > > > > >> >> flag is for compatibility consideration, since old EOS
> code
> > > > > would not
> > > > > > > >> know
> > > > > > > >> >> that a given transaction commit was failed internally as
> > they
> > > > > don't
> > > > > > > >> listen
> > > > > > > >> >> to the output flag. Our proposed alternative solution is
> to
> > > let
> > > > > > > >> >> *commitTransaction
> > > > > > > >> >> throw CommitFailedException whenever the commit failed
> with
> > > > > non-fatal
> > > > > > > >> >> exception*, so that on the caller side the handling logic
> > > > > becomes:
> > > > > > > >> >>
> > > > > > > >> >> try {
> > > > > > > >> >>         if (shouldCommit) {
> > > > > > > >> >>             producer.commitTransaction();
> > > > > > > >> >>         } else {
> > > > > > > >> >>             resetToLastCommittedPositions(consumer);
> > > > > > > >> >>             producer.abortTransaction();
> > > > > > > >> >>         }
> > > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > > >> >>         // Transaction commit failed with abortable
> error,
> > > user
> > > > > could
> > > > > > > >> >> reset
> > > > > > > >> >>         // the application state and resume with a new
> > > > > transaction.
> > > > > > > The
> > > > > > > >> >> root
> > > > > > > >> >>         // cause was wrapped in the thrown exception.
> > > > > > > >> >>         resetToLastCommittedPositions(consumer);
> > > > > > > >> >>         producer.abortTransaction();
> > > > > > > >> >>     } catch (KafkaException e) {
> > > > > > > >> >>         producer.close();
> > > > > > > >> >>         consumer.close();
> > > > > > > >> >>         throw e;
> > > > > > > >> >>     }
> > > > > > > >> >>
> > > > > > > >> >> This approach looks cleaner as all exception types other
> > than
> > > > > > > >> CommitFailed
> > > > > > > >> >> will doom to be fatal, which is very easy to adopt for
> > users.
> > > > In
> > > > > the
> > > > > > > >> >> meantime, we still maintain the commitTxn behavior to
> throw
> > > > > instead
> > > > > > > of
> > > > > > > >> >> silently failing.
> > > > > > > >> >>
> > > > > > > >> >> In addition, we decided to drop the recommendation to
> > handle
> > > > > > > >> >> TimeoutException and leave it to the users to make the
> > call.
> > > > The
> > > > > > > >> downside
> > > > > > > >> >> for blindly calling abortTxn upon timeout is that we
> could
> > > > > result in
> > > > > > > an
> > > > > > > >> >> illegal state when the commit was already successful on
> the
> > > > > broker
> > > > > > > >> >> side. Without a good guarantee on the outcome,
> > > overcomplicating
> > > > > the
> > > > > > > >> >> template should not be encouraged IMHO.
> > > > > > > >> >>
> > > > > > > >> >> Let me know your thoughts on the new approach here, thank
> > > you!
> > > > > > > >> >>
> > > > > > > >> >> Best,
> > > > > > > >> >> Boyang
> > > > > > > >> >>
> > > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <
> > > > > wangguoz@gmail.com>
> > > > > > > >> >> wrote:
> > > > > > > >> >>
> > > > > > > >> >> > Thanks for your clarification on 2)/3), that makes
> sense.
> > > > > > > >> >> >
> > > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > > >> >> > wrote:
> > > > > > > >> >> >
> > > > > > > >> >> > > Thanks for the input Guozhang, replied inline.
> > > > > > > >> >> > >
> > > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> > > > > > > wangguoz@gmail.com>
> > > > > > > >> >> > wrote:
> > > > > > > >> >> > >
> > > > > > > >> >> > > > Hello Boyang,
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > Thanks for the updated KIP. I read it again and
> have
> > > the
> > > > > > > >> following
> > > > > > > >> >> > > > thoughts:
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > 0. I'm a bit concerned that if commitTxn does not
> > throw
> > > > any
> > > > > > > >> >> non-fatal
> > > > > > > >> >> > > > exception, and instead we rely on the subsequent
> > > beginTxn
> > > > > call
> > > > > > > to
> > > > > > > >> >> > throw,
> > > > > > > >> >> > > it
> > > > > > > >> >> > > > may violate some callers with a pattern that
> relying
> > on
> > > > > > > >> commitTxn to
> > > > > > > >> >> > > > succeed to make some non-rollback operations. For
> > > > example:
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > beginTxn()
> > > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > > >> >> > > > commitTxn()
> > > > > > > >> >> > > > // if commitTxn succeeds, then commit the DB
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > -------------
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > The issue is that, committing DB is a non-rollback
> > > > > operation,
> > > > > > > and
> > > > > > > >> >> users
> > > > > > > >> >> > > may
> > > > > > > >> >> > > > just rely on commitTxn to return without error to
> > make
> > > > this
> > > > > > > >> >> > non-rollback
> > > > > > > >> >> > > > call. Of course we can just claim this pattern is
> not
> > > > > > > legitimate
> > > > > > > >> >> and is
> > > > > > > >> >> > > not
> > > > > > > >> >> > > > the right way of doing things, but many users may
> > > > naturally
> > > > > > > adopt
> > > > > > > >> >> this
> > > > > > > >> >> > > > pattern.
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > So maybe we should still let commitTxn also throw
> > > > non-fatal
> > > > > > > >> >> exceptions,
> > > > > > > >> >> > > in
> > > > > > > >> >> > > > which case we would then call abortTxn again.
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > But if we do this, the pattern becomes:
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > try {
> > > > > > > >> >> > > >    beginTxn()
> > > > > > > >> >> > > >    // do something
> > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > >> >> > > > }
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > if (shouldCommit) {
> > > > > > > >> >> > > >     try {
> > > > > > > >> >> > > >         commitTxn()
> > > > > > > >> >> > > >     } catch (...) {        // enumerate all fatal
> > > > > exceptions
> > > > > > > >> >> > > >         shutdown()
> > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > >> >> > > >         // non-fatal
> > > > > > > >> >> > > >         shouldCommit = false;
> > > > > > > >> >> > > >     }
> > > > > > > >> >> > > > }
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > > >> >> > > > try {
> > > > > > > >> >> > > >         abortTxn()
> > > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > > >> >> > > >         // only throw fatal
> > > > > > > >> >> > > >         shutdown()
> > > > > > > >> >> > > >     }
> > > > > > > >> >> > > > }
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > ---------------------
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > Which is much more complicated.
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > Thank makes me think, the alternative we have
> > discussed
> > > > > offline
> > > > > > > >> may
> > > > > > > >> >> be
> > > > > > > >> >> > > > better: let commitTxn() to return a boolean flag.
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > * If it returns true, it means the commit
> succeeded.
> > > > Users
> > > > > can
> > > > > > > >> >> > > comfortably
> > > > > > > >> >> > > > continue and do any external non-rollback
> operations
> > if
> > > > > they
> > > > > > > >> like.
> > > > > > > >> >> > > > * If it returns false, it means the commit failed
> > with
> > > > > > > non-fatal
> > > > > > > >> >> error
> > > > > > > >> >> > > *AND
> > > > > > > >> >> > > > the txn has been aborted*. Users do not need to
> call
> > > > > abortTxn
> > > > > > > >> again.
> > > > > > > >> >> > > > * If it throws, then it means fatal errors. Users
> > > should
> > > > > shut
> > > > > > > >> down
> > > > > > > >> >> the
> > > > > > > >> >> > > > client.
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > try {
> > > > > > > >> >> > > >    beginTxn()
> > > > > > > >> >> > > >    // do something
> > > > > > > >> >> > > > } catch (Exception) {
> > > > > > > >> >> > > >    shouldCommit = false;
> > > > > > > >> >> > > > }
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > try {
> > > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > > >> >> > > >     } else {
> > > > > > > >> >> > > >         // reset offsets, rollback operations, etc
> > > > > > > >> >> > > >         abortTxn()
> > > > > > > >> >> > > >     }
> > > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > > >> >> > > >     // only throw fatal
> > > > > > > >> >> > > >     shutdown()
> > > > > > > >> >> > > > }
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > if (commitSucceeded)
> > > > > > > >> >> > > >    // do other non-rollbackable things
> > > > > > > >> >> > > > else
> > > > > > > >> >> > > >    // reset offsets, rollback operations, etc
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > -------------------------
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > Of course, if we want to have better visibility
> into
> > > what
> > > > > > > caused
> > > > > > > >> the
> > > > > > > >> >> > > commit
> > > > > > > >> >> > > > to fail and txn to abort. We can let the return
> type
> > be
> > > > an
> > > > > enum
> > > > > > > >> >> instead
> > > > > > > >> >> > > of
> > > > > > > >> >> > > > a flag. But my main idea is to still let the
> > commitTxn
> > > be
> > > > > the
> > > > > > > >> final
> > > > > > > >> >> > point
> > > > > > > >> >> > > > users can tell whether this txn succeeded or not,
> > > instead
> > > > > of
> > > > > > > >> >> relying on
> > > > > > > >> >> > > the
> > > > > > > >> >> > > > next beginTxn() call.
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > I agree that adding a boolean flag is indeed useful
> > in
> > > > this
> > > > > > > case.
> > > > > > > >> >> Will
> > > > > > > >> >> > > update the KIP.
> > > > > > > >> >> > >
> > > > > > > >> >> > > 1. Re: "while maintaining the behavior to throw fatal
> > > > > exception
> > > > > > > in
> > > > > > > >> >> raw"
> > > > > > > >> >> > not
> > > > > > > >> >> > > > sure what do you mean by "throw" here. Are you
> > > proposing
> > > > > the
> > > > > > > >> >> callback
> > > > > > > >> >> > > would
> > > > > > > >> >> > > > *pass* (not throw) in any fatal exceptions when
> > > triggered
> > > > > > > without
> > > > > > > >> >> > > wrapping?
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > Yes, I want to say *pass*, the benefit is to make
> the
> > > end
> > > > > > > user's
> > > > > > > >> >> > > expectation consistent
> > > > > > > >> >> > > regarding exception handling.
> > > > > > > >> >> > >
> > > > > > > >> >> > >
> > > > > > > >> >> > > > 2. I'm not sure I understand the claim regarding
> the
> > > > > callback
> > > > > > > >> that
> > > > > > > >> >> "In
> > > > > > > >> >> > > EOS
> > > > > > > >> >> > > > setup, it is not required to handle the exception".
> > Are
> > > > you
> > > > > > > >> >> proposing
> > > > > > > >> >> > > that,
> > > > > > > >> >> > > > e.g. in Streams, we do not try to handle any
> > exceptions
> > > > if
> > > > > EOS
> > > > > > > is
> > > > > > > >> >> > enabled
> > > > > > > >> >> > > > in the callback anymore, but just let commitTxn()
> > > itself
> > > > to
> > > > > > > fail
> > > > > > > >> to
> > > > > > > >> >> be
> > > > > > > >> >> > > > notified about the problem? Personally I think in
> > > Streams
> > > > > we
> > > > > > > >> should
> > > > > > > >> >> > just
> > > > > > > >> >> > > > make the handling logic of the callback to be
> > > consistent
> > > > > > > >> regardless
> > > > > > > >> >> of
> > > > > > > >> >> > > the
> > > > > > > >> >> > > > EOS settings (today we have different logic
> depending
> > > on
> > > > > this
> > > > > > > >> logic,
> > > > > > > >> >> > > which
> > > > > > > >> >> > > > I think could be unified as well).
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > My idea originates from the claim on send API:
> > > > > > > >> >> > > "When used as part of a transaction, it is not
> > necessary
> > > to
> > > > > > > define
> > > > > > > >> a
> > > > > > > >> >> > > callback or check the result of the future  in order
> to
> > > > > detect
> > > > > > > >> errors
> > > > > > > >> >> > from
> > > > > > > >> >> > > <code>send</code>. "
> > > > > > > >> >> > > My understanding is that for EOS, the exception will
> be
> > > > > detected
> > > > > > > by
> > > > > > > >> >> > calling
> > > > > > > >> >> > > transactional APIs either way, so it is a duplicate
> > > > handling
> > > > > to
> > > > > > > >> track
> > > > > > > >> >> > > all the sendExceptions in RecordCollector. However, I
> > > > looked
> > > > > up
> > > > > > > >> >> > > sendException is being used today as follow:
> > > > > > > >> >> > >
> > > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for any
> default
> > > or
> > > > > > > >> customized
> > > > > > > >> >> > > exception handler to handle
> > > > > > > >> >> > > 2. Stop collecting offset info or new exceptions
> > > > > > > >> >> > > 3. Check and rethrow exceptions in close(), flush()
> or
> > > new
> > > > > send()
> > > > > > > >> >> calls
> > > > > > > >> >> > >
> > > > > > > >> >> > > To my understanding, we should still honor the
> > commitment
> > > > to
> > > > > #1
> > > > > > > for
> > > > > > > >> >> any
> > > > > > > >> >> > > user customized implementation. The #2 does not
> really
> > > > > affect our
> > > > > > > >> >> > decision
> > > > > > > >> >> > > upon EOS. The #3 here is still valuable as it could
> > help
> > > us
> > > > > fail
> > > > > > > >> fast
> > > > > > > >> >> in
> > > > > > > >> >> > > new send() instead of waiting to later stage of
> > > processing.
> > > > > In
> > > > > > > that
> > > > > > > >> >> > sense,
> > > > > > > >> >> > > I agree we should continue to record send exceptions
> > even
> > > > > under
> > > > > > > EOS
> > > > > > > >> >> case
> > > > > > > >> >> > to
> > > > > > > >> >> > > ensure the strength of stream side Producer logic. On
> > the
> > > > > safer
> > > > > > > >> side,
> > > > > > > >> >> we
> > > > > > > >> >> > no
> > > > > > > >> >> > > longer need to wrap certain fatal exceptions like
> > > > > ProducerFenced
> > > > > > > as
> > > > > > > >> >> > > TaskMigrated, since they should not crash the stream
> > > thread
> > > > > if
> > > > > > > >> thrown
> > > > > > > >> >> in
> > > > > > > >> >> > > raw format, once we adopt the new processing model in
> > the
> > > > > send
> > > > > > > >> phase.
> > > > > > > >> >> > >
> > > > > > > >> >> > >
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > Guozhang
> > > > > > > >> >> > > >
> > > > > > > >> >> > > >
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > > >> >> > > > wrote:
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > > Thanks for everyone's feedback so far. I have
> > > polished
> > > > > the
> > > > > > > KIP
> > > > > > > >> >> after
> > > > > > > >> >> > > > > offline discussion with some folks working on EOS
> > to
> > > > > make the
> > > > > > > >> >> > exception
> > > > > > > >> >> > > > > handling more lightweight. The essential change
> is
> > > that
> > > > > we
> > > > > > > are
> > > > > > > >> not
> > > > > > > >> >> > > > > inventing a new intermediate exception type, but
> > > > instead
> > > > > > > >> >> separating a
> > > > > > > >> >> > > > full
> > > > > > > >> >> > > > > transaction into two phases:
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > > >> >> > > > > 2. The commit phase
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > This way for any exception thrown from phase 1,
> > will
> > > be
> > > > > an
> > > > > > > >> >> indicator
> > > > > > > >> >> > in
> > > > > > > >> >> > > > > phase 2 whether we should do commit or abort, and
> > > from
> > > > > now on
> > > > > > > >> >> > > > > `commitTransaction` should only throw fatal
> > > exceptions,
> > > > > > > >> similar to
> > > > > > > >> >> > > > > `abortTransaction`, so that any KafkaException
> > caught
> > > > in
> > > > > the
> > > > > > > >> >> commit
> > > > > > > >> >> > > phase
> > > > > > > >> >> > > > > will be definitely fatal to crash the app. For
> more
> > > > > advanced
> > > > > > > >> users
> > > > > > > >> >> > such
> > > > > > > >> >> > > > as
> > > > > > > >> >> > > > > Streams, we have the ability to further wrap
> > selected
> > > > > types
> > > > > > > of
> > > > > > > >> >> fatal
> > > > > > > >> >> > > > > exceptions to trigger task migration if
> necessary.
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > More details in the KIP, feel free to take
> another
> > > > look,
> > > > > > > >> thanks!
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > > >> >> > > > > wrote:
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
> > > > > > > >> >> bruno@confluent.io>
> > > > > > > >> >> > > > wrote:
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >> Like Matthias, I do also not know the producer
> > > > > internal
> > > > > > > well
> > > > > > > >> >> > enough
> > > > > > > >> >> > > to
> > > > > > > >> >> > > > > >> comment on the categorization. However, I
> think
> > > > > having a
> > > > > > > >> super
> > > > > > > >> >> > > > exception
> > > > > > > >> >> > > > > >> (e.g. RetriableException) that  encodes if an
> > > > > exception is
> > > > > > > >> >> fatal
> > > > > > > >> >> > or
> > > > > > > >> >> > > > not
> > > > > > > >> >> > > > > >> is cleaner, better understandable and less
> > > > > error-prone,
> > > > > > > >> because
> > > > > > > >> >> > > > ideally
> > > > > > > >> >> > > > > >> when you add a new non-fatal exception in
> future
> > > you
> > > > > just
> > > > > > > >> need
> > > > > > > >> >> to
> > > > > > > >> >> > > > think
> > > > > > > >> >> > > > > >> about letting it inherit from the super
> > exception
> > > > and
> > > > > all
> > > > > > > >> the
> > > > > > > >> >> rest
> > > > > > > >> >> > > of
> > > > > > > >> >> > > > > >> the code will just behave correctly without
> the
> > > need
> > > > > to
> > > > > > > wrap
> > > > > > > >> >> the
> > > > > > > >> >> > new
> > > > > > > >> >> > > > > >> exception into another exception each time it
> is
> > > > > thrown
> > > > > > > >> (maybe
> > > > > > > >> >> it
> > > > > > > >> >> > is
> > > > > > > >> >> > > > > >> thrown at different location in the code).
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >> As far as I understand the following statement
> > > from
> > > > > your
> > > > > > > >> >> previous
> > > > > > > >> >> > > > e-mail
> > > > > > > >> >> > > > > >> is the reason that currently such a super
> > > exception
> > > > > is not
> > > > > > > >> >> > possible:
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >> "Right now we have RetriableException type, if
> > we
> > > > are
> > > > > > > going
> > > > > > > >> to
> > > > > > > >> >> > add a
> > > > > > > >> >> > > > > >> `ProducerRetriableException` type, we have to
> > put
> > > > > this new
> > > > > > > >> >> > interface
> > > > > > > >> >> > > > as
> > > > > > > >> >> > > > > >> the parent of the RetriableException, because
> > not
> > > > all
> > > > > > > thrown
> > > > > > > >> >> > > non-fatal
> > > > > > > >> >> > > > > >> exceptions are `retriable` in general for
> > > producer"
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >> In the list of exceptions in your KIP, I found
> > > > > non-fatal
> > > > > > > >> >> > exceptions
> > > > > > > >> >> > > > that
> > > > > > > >> >> > > > > >> do not inherit from RetriableException. I
> guess
> > > > those
> > > > > are
> > > > > > > >> the
> > > > > > > >> >> ones
> > > > > > > >> >> > > you
> > > > > > > >> >> > > > > >> are referring to in your statement:
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >> All of those exceptions are non-fatal and do
> not
> > > > > inherit
> > > > > > > >> from
> > > > > > > >> >> > > > > >> RetriableException. Is there a reason for
> that?
> > If
> > > > > they
> > > > > > > >> >> depended
> > > > > > > >> >> > > from
> > > > > > > >> >> > > > > >> RetriableException we would be a bit closer
> to a
> > > > super
> > > > > > > >> >> exception I
> > > > > > > >> >> > > > > >> mention above.
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >> The reason is that sender may catch those
> > > exceptions
> > > > > in
> > > > > > > the
> > > > > > > >> >> > > > > > ProduceResponse, and it currently does infinite
> > > > > > > >> >> > > > > > retries on RetriableException. To make sure we
> > > could
> > > > > > > trigger
> > > > > > > >> the
> > > > > > > >> >> > > > > > abortTransaction() by catching non-fatal thrown
> > > > > > > >> >> > > > > > exceptions and reinitialize the txn state, we
> > chose
> > > > > not to
> > > > > > > >> let
> > > > > > > >> >> > those
> > > > > > > >> >> > > > > > exceptions inherit RetriableException so that
> > > > > > > >> >> > > > > > they won't cause infinite retry on sender.
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > > > > > >> >> UnknownProducerIdException, I
> > > > > > > >> >> > > > think
> > > > > > > >> >> > > > > >> to understand that their fatality depends on
> the
> > > > type
> > > > > > > (i.e.
> > > > > > > >> >> > > > > >> configuration) of the producer. That makes it
> > > > > difficult to
> > > > > > > >> >> have a
> > > > > > > >> >> > > > super
> > > > > > > >> >> > > > > >> exception that encodes the retriability as
> > > mentioned
> > > > > > > above.
> > > > > > > >> >> Would
> > > > > > > >> >> > it
> > > > > > > >> >> > > > be
> > > > > > > >> >> > > > > >> possible to introduce exceptions that inherit
> > from
> > > > > > > >> >> > > RetriableException
> > > > > > > >> >> > > > > >> and that are thrown when those exceptions are
> > > caught
> > > > > from
> > > > > > > >> the
> > > > > > > >> >> > > brokers
> > > > > > > >> >> > > > > >> and the type of the producer is such that the
> > > > > exceptions
> > > > > > > are
> > > > > > > >> >> > > > retriable?
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >> Yea, I think in general the exception type
> > mixing
> > > is
> > > > > > > >> difficult
> > > > > > > >> >> to
> > > > > > > >> >> > > get
> > > > > > > >> >> > > > > > them all right. I have already proposed another
> > > > > solution
> > > > > > > >> based
> > > > > > > >> >> on
> > > > > > > >> >> > my
> > > > > > > >> >> > > > > > offline discussion with some folks working on
> EOS
> > > > > > > >> >> > > > > > to make the handling more straightforward for
> end
> > > > users
> > > > > > > >> without
> > > > > > > >> >> the
> > > > > > > >> >> > > > need
> > > > > > > >> >> > > > > > to distinguish exception fatality.
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > > >> Best,
> > > > > > > >> >> > > > > >> Bruno
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > > > > >> >> > > > > >> > Thanks Boyang for the proposal! I made a
> pass
> > > over
> > > > > the
> > > > > > > >> list
> > > > > > > >> >> and
> > > > > > > >> >> > > here
> > > > > > > >> >> > > > > are
> > > > > > > >> >> > > > > >> > some thoughts:
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> > 0) Although this is not part of the public
> > API,
> > > I
> > > > > think
> > > > > > > we
> > > > > > > >> >> > should
> > > > > > > >> >> > > > make
> > > > > > > >> >> > > > > >> sure
> > > > > > > >> >> > > > > >> > that the suggested pattern (i.e. user can
> > always
> > > > > call
> > > > > > > >> >> abortTxn()
> > > > > > > >> >> > > > when
> > > > > > > >> >> > > > > >> > handling non-fatal errors) are indeed
> > supported.
> > > > > E.g. if
> > > > > > > >> the
> > > > > > > >> >> txn
> > > > > > > >> >> > > is
> > > > > > > >> >> > > > > >> already
> > > > > > > >> >> > > > > >> > aborted by the broker side, then users can
> > still
> > > > > call
> > > > > > > >> >> abortTxn
> > > > > > > >> >> > > which
> > > > > > > >> >> > > > > >> would
> > > > > > > >> >> > > > > >> > not throw another exception but just be
> > treated
> > > > as a
> > > > > > > >> no-op.
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> > 1) *ConcurrentTransactionsException*: I
> think
> > > this
> > > > > error
> > > > > > > >> can
> > > > > > > >> >> > also
> > > > > > > >> >> > > be
> > > > > > > >> >> > > > > >> > returned but not documented yet. This should
> > be
> > > a
> > > > > > > >> non-fatal
> > > > > > > >> >> > error.
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this error is
> > > > > returned
> > > > > > > from
> > > > > > > >> >> > broker
> > > > > > > >> >> > > > when
> > > > > > > >> >> > > > > >> txn
> > > > > > > >> >> > > > > >> > state transition failed (e.g. it is trying
> to
> > > > > transit to
> > > > > > > >> >> > > > > complete-commit
> > > > > > > >> >> > > > > >> > while the current state is not
> > prepare-commit).
> > > > This
> > > > > > > error
> > > > > > > >> >> could
> > > > > > > >> >> > > > > >> indicates
> > > > > > > >> >> > > > > >> > a bug on the client internal code or the
> > broker
> > > > > code,
> > > > > > > OR a
> > > > > > > >> >> user
> > > > > > > >> >> > > > error
> > > > > > > >> >> > > > > >> --- a
> > > > > > > >> >> > > > > >> > similar error is
> > > ConcurrentTransactionsException,
> > > > > i.e.
> > > > > > > if
> > > > > > > >> >> Kafka
> > > > > > > >> >> > is
> > > > > > > >> >> > > > > >> bug-free
> > > > > > > >> >> > > > > >> > these exceptions would only be returned if
> > users
> > > > > try to
> > > > > > > do
> > > > > > > >> >> > > something
> > > > > > > >> >> > > > > >> wrong,
> > > > > > > >> >> > > > > >> > e.g. calling abortTxn right after a
> commitTxn,
> > > > etc.
> > > > > So
> > > > > > > I'm
> > > > > > > >> >> > > thinking
> > > > > > > >> >> > > > it
> > > > > > > >> >> > > > > >> > should be a non-fatal error instead of a
> fatal
> > > > > error,
> > > > > > > >> wdyt?
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> > 3) *KafkaException*: case i "indicates fatal
> > > > > > > transactional
> > > > > > > >> >> > > sequence
> > > > > > > >> >> > > > > >> > (Fatal)", this is a bit conflicting with the
> > > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > > >> >> > > > > >> that
> > > > > > > >> >> > > > > >> > is treated as non-fatal. I guess your
> proposal
> > > is
> > > > > that
> > > > > > > >> >> > > > > >> > OutOfOrderSequenceException would be treated
> > > > either
> > > > > as
> > > > > > > >> fatal
> > > > > > > >> >> > with
> > > > > > > >> >> > > > > >> > transactional producer, or non-fatal with
> > > > idempotent
> > > > > > > >> >> producer,
> > > > > > > >> >> > is
> > > > > > > >> >> > > > that
> > > > > > > >> >> > > > > >> > right? If the producer is only configured
> with
> > > > > > > idempotency
> > > > > > > >> >> but
> > > > > > > >> >> > not
> > > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > > >> >> TransactionStateCorruptedException
> > > > > > > >> >> > > for
> > > > > > > >> >> > > > > >> > non-fatal errors would be confusing since
> > users
> > > > are
> > > > > not
> > > > > > > >> using
> > > > > > > >> >> > > > > >> transactions
> > > > > > > >> >> > > > > >> > at all.. So I suggest we always throw
> > > > > > > >> OutOfSequenceException
> > > > > > > >> >> > as-is
> > > > > > > >> >> > > > > (i.e.
> > > > > > > >> >> > > > > >> > not wrapped) no matter how the producer is
> > > > > configured,
> > > > > > > and
> > > > > > > >> >> let
> > > > > > > >> >> > the
> > > > > > > >> >> > > > > >> caller
> > > > > > > >> >> > > > > >> > decide how to handle it based on whether it
> is
> > > > only
> > > > > > > >> >> idempotent
> > > > > > > >> >> > or
> > > > > > > >> >> > > > > >> > transactional itself.
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the `send()`
> > > > callback /
> > > > > > > >> future
> > > > > > > >> >> can
> > > > > > > >> >> > > also
> > > > > > > >> >> > > > > >> throw
> > > > > > > >> >> > > > > >> > txn-related exceptions, I think this KIP
> > should
> > > > also
> > > > > > > cover
> > > > > > > >> >> this
> > > > > > > >> >> > > API
> > > > > > > >> >> > > > as
> > > > > > > >> >> > > > > >> well?
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> > 5) This is related to 1/2) above: sometimes
> > > those
> > > > > > > >> non-fatal
> > > > > > > >> >> > errors
> > > > > > > >> >> > > > > like
> > > > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not due
> > to
> > > > the
> > > > > > > state
> > > > > > > >> >> being
> > > > > > > >> >> > > > > >> corrupted
> > > > > > > >> >> > > > > >> > at the broker side, but maybe users are
> doing
> > > > > something
> > > > > > > >> >> wrong.
> > > > > > > >> >> > So
> > > > > > > >> >> > > > I'm
> > > > > > > >> >> > > > > >> > wondering if we should further distinguish
> > those
> > > > > > > non-fatal
> > > > > > > >> >> > errors
> > > > > > > >> >> > > > > >> between
> > > > > > > >> >> > > > > >> > a) those that are caused by Kafka itself,
> > e.g. a
> > > > > broker
> > > > > > > >> timed
> > > > > > > >> >> > out
> > > > > > > >> >> > > > and
> > > > > > > >> >> > > > > >> > aborted a txn and later an endTxn request is
> > > > > received,
> > > > > > > >> and b)
> > > > > > > >> >> > the
> > > > > > > >> >> > > > > user's
> > > > > > > >> >> > > > > >> > API call pattern is incorrect, causing the
> > > request
> > > > > to be
> > > > > > > >> >> > rejected
> > > > > > > >> >> > > > with
> > > > > > > >> >> > > > > >> an
> > > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > > >> >> > > > feels
> > > > > > > >> >> > > > > >> to
> > > > > > > >> >> > > > > >> > me more like for case a), but not case b).
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> > Guozhang
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > > >> >> > > > > >> > wrote:
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> >> Thanks Matthias, I think your proposal
> makes
> > > > sense
> > > > > as
> > > > > > > >> well,
> > > > > > > >> >> on
> > > > > > > >> >> > > the
> > > > > > > >> >> > > > > pro
> > > > > > > >> >> > > > > >> side
> > > > > > > >> >> > > > > >> >> we could have a universally agreed
> exception
> > > type
> > > > > to be
> > > > > > > >> >> caught
> > > > > > > >> >> > by
> > > > > > > >> >> > > > the
> > > > > > > >> >> > > > > >> user,
> > > > > > > >> >> > > > > >> >> without having an extra layer on top of the
> > > > actual
> > > > > > > >> >> exceptions.
> > > > > > > >> >> > I
> > > > > > > >> >> > > > > could
> > > > > > > >> >> > > > > >> see
> > > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > > >> >> > > > > >> >>
> > > > > > > >> >> > > > > >> >> 1. The exception hierarchy will be more
> > > complex.
> > > > > Right
> > > > > > > >> now
> > > > > > > >> >> we
> > > > > > > >> >> > > have
> > > > > > > >> >> > > > > >> >> RetriableException type, if we are going to
> > > add a
> > > > > > > >> >> > > > > >> >> `ProducerRetriableException` type, we have
> to
> > > put
> > > > > this
> > > > > > > >> new
> > > > > > > >> >> > > > interface
> > > > > > > >> >> > > > > >> as the
> > > > > > > >> >> > > > > >> >> parent of the RetriableException, because
> not
> > > all
> > > > > > > thrown
> > > > > > > >> >> > > non-fatal
> > > > > > > >> >> > > > > >> >> exceptions are `retriable` in general for
> > > > > producer, for
> > > > > > > >> >> example
> > > > > > > >> >> > > > > >> >> <
> > > > > > > >> >> > > > > >> >>
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > >
> > > > > > > >> >> > >
> > > > > > > >> >> >
> > > > > > > >> >>
> > > > > > > >>
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > > >> >> > > > > >> >>> .
> > > > > > > >> >> > > > > >> >> We could have an empty interface
> > > > > > > >> >> `ProducerRetriableException`
> > > > > > > >> >> > to
> > > > > > > >> >> > > > let
> > > > > > > >> >> > > > > >> all
> > > > > > > >> >> > > > > >> >> the thrown exceptions implement for sure,
> but
> > > > it's
> > > > > a
> > > > > > > bit
> > > > > > > >> >> > > unorthodox
> > > > > > > >> >> > > > > in
> > > > > > > >> >> > > > > >> the
> > > > > > > >> >> > > > > >> >> way we deal with exceptions in general.
> > > > > > > >> >> > > > > >> >>
> > > > > > > >> >> > > > > >> >> 2. There are cases where we throw a
> > > > KafkaException
> > > > > > > >> wrapping
> > > > > > > >> >> > > another
> > > > > > > >> >> > > > > >> >> KafkaException as either fatal or
> non-fatal.
> > If
> > > > we
> > > > > use
> > > > > > > an
> > > > > > > >> >> > > interface
> > > > > > > >> >> > > > > to
> > > > > > > >> >> > > > > >> >> solve #1, it is also required to implement
> > > > another
> > > > > > > >> bloated
> > > > > > > >> >> > > > exception
> > > > > > > >> >> > > > > >> class
> > > > > > > >> >> > > > > >> >> which could replace KafkaException type, as
> > we
> > > > > couldn't
> > > > > > > >> mark
> > > > > > > >> >> > > > > >> KafkaException
> > > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > > >> >> > > > > >> >>
> > > > > > > >> >> > > > > >> >> 3. In terms of the encapsulation, wrapping
> > > means
> > > > we
> > > > > > > could
> > > > > > > >> >> limit
> > > > > > > >> >> > > the
> > > > > > > >> >> > > > > >> scope
> > > > > > > >> >> > > > > >> >> of affection to the producer only, which is
> > > > > important
> > > > > > > >> since
> > > > > > > >> >> we
> > > > > > > >> >> > > > don't
> > > > > > > >> >> > > > > >> want
> > > > > > > >> >> > > > > >> >> shared exception types to implement a
> > > > > producer-related
> > > > > > > >> >> > interface,
> > > > > > > >> >> > > > > such
> > > > > > > >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> > > > > > > >> >> > > > > >> >>
> > > > > > > >> >> > > > > >> >> Best,
> > > > > > > >> >> > > > > >> >> Boyang
> > > > > > > >> >> > > > > >> >>
> > > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J.
> > Sax
> > > <
> > > > > > > >> >> > mjsax@apache.org
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > > >> wrote:
> > > > > > > >> >> > > > > >> >>
> > > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > > >> >> > > > > >> >>>
> > > > > > > >> >> > > > > >> >>> Overall, categorizing exceptions makes a
> lot
> > > of
> > > > > sense.
> > > > > > > >> As I
> > > > > > > >> >> > > don't
> > > > > > > >> >> > > > > know
> > > > > > > >> >> > > > > >> >>> the producer internals well enough, I
> cannot
> > > > > comment
> > > > > > > on
> > > > > > > >> the
> > > > > > > >> >> > > > > >> >>> categorization in detail though.
> > > > > > > >> >> > > > > >> >>>
> > > > > > > >> >> > > > > >> >>> What I am wondering is, if we should
> > introduce
> > > > an
> > > > > > > >> exception
> > > > > > > >> >> > > > > interface
> > > > > > > >> >> > > > > >> >>> that non-fatal exception would implement
> > > instead
> > > > > of
> > > > > > > >> >> creating a
> > > > > > > >> >> > > new
> > > > > > > >> >> > > > > >> class
> > > > > > > >> >> > > > > >> >>> that will wrap non-fatal exceptions? What
> > > would
> > > > > be the
> > > > > > > >> >> > pros/cons
> > > > > > > >> >> > > > for
> > > > > > > >> >> > > > > >> >>> both designs?
> > > > > > > >> >> > > > > >> >>>
> > > > > > > >> >> > > > > >> >>>
> > > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > > >> >> > > > > >> >>>
> > > > > > > >> >> > > > > >> >>>
> > > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > > >> >> > > > > >> >>>>
> > > > > > > >> >> > > > > >> >>>> I would like to start a discussion thread
> > for
> > > > > > > KIP-691:
> > > > > > > >> >> > > > > >> >>>>
> > > > > > > >> >> > > > > >> >>>
> > > > > > > >> >> > > > > >> >>
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > >
> > > > > > > >> >> > >
> > > > > > > >> >> >
> > > > > > > >> >>
> > > > > > > >>
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > > >> >> > > > > >> >>>>
> > > > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify the
> exception
> > > > > handling
> > > > > > > >> logic
> > > > > > > >> >> > for
> > > > > > > >> >> > > > > >> >>>> transactional Producer users by
> classifying
> > > > > fatal and
> > > > > > > >> >> > non-fatal
> > > > > > > >> >> > > > > >> >>> exceptions
> > > > > > > >> >> > > > > >> >>>> and throw them correspondingly for easier
> > > catch
> > > > > and
> > > > > > > >> retry.
> > > > > > > >> >> > Let
> > > > > > > >> >> > > me
> > > > > > > >> >> > > > > >> know
> > > > > > > >> >> > > > > >> >>> what
> > > > > > > >> >> > > > > >> >>>> you think.
> > > > > > > >> >> > > > > >> >>>>
> > > > > > > >> >> > > > > >> >>>> Best,
> > > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > > >> >> > > > > >> >>>>
> > > > > > > >> >> > > > > >> >>>
> > > > > > > >> >> > > > > >> >>
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >> >
> > > > > > > >> >> > > > > >>
> > > > > > > >> >> > > > > >
> > > > > > > >> >> > > > >
> > > > > > > >> >> > > >
> > > > > > > >> >> > > >
> > > > > > > >> >> > > > --
> > > > > > > >> >> > > > -- Guozhang
> > > > > > > >> >> > > >
> > > > > > > >> >> > >
> > > > > > > >> >> >
> > > > > > > >> >> >
> > > > > > > >> >> > --
> > > > > > > >> >> > -- Guozhang
> > > > > > > >> >> >
> > > > > > > >> >>
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > --
> > > > > > > >> > -- Guozhang
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> --
> > > > > > > >> -- Guozhang
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > >
> > > >
> > >
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Thanks Jason, I agree with the proposed solution here, will update the KIP.

On Thu, Jan 28, 2021 at 10:52 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Boyang,
>
> It seems like a reasonable suggestion. I wonder if a flag is sufficient
> though. The current `Callback` documentation treats "fatal" errors from the
> perspective of the individual message that was sent.
>
> ```
>     *                  Non-Retriable exceptions (fatal, the message will
> never be sent):
> ```
>
> However, we also have fatal errors from the perspective of the transaction
> (e.g. when the producer gets fenced). Perhaps that suggests we need
> something richer than a boolean flag:
>
> At a high level, I think the following cases are possible:
>
> - message rejected (e.g. message too large, invalid topic)
> - delivery failed after retries/delivery timeout (e.g. timeout, crc error,
> not enough replicas)
> - transaction failed (e.g. producer fenced, invalid transaction state)
>
> Perhaps instead we can have a type like the following:
>
> class SendFailure {
>   FailureType failureType;
>   Exception cause;
> }
>
> enum FailureType {
>   MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
> }
>
> (Not married to any of these names, just a starting point.)
>
> Then we add a new `onCompletion` as you've suggested:
>
> default void onCompletion(RecordMetadata metadata, SendFailure failure) {
>   this.onCompletion(metadata, failure.cause());
> }
>
> This would give streams and other applications enough information to know
> whether the message can be retried and whether the transaction can be
> aborted.
>
> What do you think?
>
> -Jason
>
>
> On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Thanks Jason for the thoughts.
> >
> > On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi Boyang,
> > >
> > > Thanks for the iterations here. I think this is something we should
> have
> > > done a long time ago. It sounds like there are two API changes here:
> > >
> > > 1. We are introducing the `CommitFailedException` to wrap abortable
> > > errors that are raised from `commitTransaction`. This sounds fine to
> me.
> > As
> > > far as I know, the only case we might need this is when we add support
> to
> > > let producers recover from coordinator timeouts. Are there any others?
> > >
> > > I think the purpose here is to ensure non-fatal exceptions are unified
> > under the same
> > exception umbrella, to make the proceeding to abort any ongoing
> transaction
> > much easier.
> > I don't think `coordinator timeouts` is the only case to recover here,
> > since we have other
> > non-fatal exceptions such as UnknownPid.
> >
> > 2. We are wrapping non-fatal errors raised from `send` as
> `KafkaException`.
> > > The motivation for this is less clear to me and it doesn't look like
> the
> > > example from the KIP depends on it. My concern here is compatibility.
> > > Currently we have the following documentation for the `Callback` API:
> > >
> > > ```
> > >      *                  Non-Retriable exceptions (fatal, the message
> will
> > > never be sent):
> > >      *
> > >      *                  InvalidTopicException
> > >      *                  OffsetMetadataTooLargeException
> > >      *                  RecordBatchTooLargeException
> > >      *                  RecordTooLargeException
> > >      *                  UnknownServerException
> > >      *                  UnknownProducerIdException
> > >      *                  InvalidProducerEpochException
> > >      *
> > >      *                  Retriable exceptions (transient, may be covered
> > by
> > > increasing #.retries):
> > >      *
> > >      *                  CorruptRecordException
> > >      *                  InvalidMetadataException
> > >      *                  NotEnoughReplicasAfterAppendException
> > >      *                  NotEnoughReplicasException
> > >      *                  OffsetOutOfRangeException
> > >      *                  TimeoutException
> > >      *                  UnknownTopicOrPartitionException
> > > ```
> > >
> > > If we wrap all the retriable exceptions documented here as
> > > `KafkaException`, wouldn't that break any error handling that users
> might
> > > already have? it's gonna introduce a compatibility issue.
> > >
> > > The original intention was to simplify `send` callback error handling
> by
> > doing exception wrapping, as on Streams level
> > we have to prepare an exhausting list of exceptions to catch as fatal,
> and
> > the same lengthy list to catch as
> > non-fatal. It would be much easier if we got `hints` from the callback.
> > However,
> > I agree there is a compatibility concern, what about deprecating the
> > existing:
> >
> > void onCompletion(RecordMetadata metadata, Exception exception)
> >
> > and replace it with:
> >
> > default void onCompletion(RecordMetadata metadata, Exception exception,
> > boolean isFatal) {
> >   this.onCompletion(metadata, exception);
> > }
> >
> > to make sure new users get the benefit of understanding the fatality
> based
> > on the info presented by the producer?
> >
> > Thanks,
> > > Jason
> > >
> > >
> > > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <in...@hiringuru.com> wrote:
> > >
> > > > Why  we are receiving all emails kindly remove us from
> > > > dev@kafka.apache.org we don't want to receive emails anymore.
> > > >
> > > > Thanks
> > > > > On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com> wrote:
> > > > >
> > > > >
> > > > > Thanks Boyang, yes I think I was confused about the different
> > handling
> > > of
> > > > > two abortTxn calls, and now I get it was not intentional. I think I
> > do
> > > > not
> > > > > have more concerns.
> > > > >
> > > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks for the clarification Guozhang, I got your point that we
> > want
> > > to
> > > > > > have a consistent handling of fatal exceptions being thrown from
> > the
> > > > > > abortTxn. I modified the current template to move the fatal
> > exception
> > > > > > try-catch outside of the processing loop to make sure we could
> get
> > a
> > > > chance
> > > > > > to close consumer/producer modules. Let me know what you think.
> > > > > >
> > > > > > Best,
> > > > > > Boyang
> > > > > >
> > > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > > reluctanthero104@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > My understanding is that abortTransaction would only throw when
> > the
> > > > > > > producer is in fatal state. For CommitFailed, the producer
> should
> > > > still
> > > > > > be
> > > > > > > in the abortable error state, so that abortTransaction call
> would
> > > not
> > > > > > throw.
> > > > > > >
> > > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <
> > wangguoz@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > >> Or are you going to maintain some internal state such that,
> the
> > > > > > >> `abortTransaction` in the catch block would never throw again?
> > > > > > >>
> > > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> > > wangguoz@gmail.com>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hi Boyang/Jason,
> > > > > > >> >
> > > > > > >> > I've also thought about this (i.e. using CommitFailed for
> all
> > > > > > >> non-fatal),
> > > > > > >> > but what I'm pondering is that, in the catch (CommitFailed)
> > > block,
> > > > > > what
> > > > > > >> > would happen if the `producer.abortTransaction();` throws
> > again?
> > > > > > should
> > > > > > >> > that be captured as a fatal and cause the client to close
> > again.
> > > > > > >> >
> > > > > > >> > If yes, then naively the pattern would be:
> > > > > > >> >
> > > > > > >> > ...
> > > > > > >> > catch (CommitFailedException e) {
> > > > > > >> >         // Transaction commit failed with abortable error,
> > user
> > > > could
> > > > > > >> reset
> > > > > > >> >         // the application state and resume with a new
> > > > transaction.
> > > > > > The
> > > > > > >> > root
> > > > > > >> >         // cause was wrapped in the thrown exception.
> > > > > > >> >         resetToLastCommittedPositions(consumer);
> > > > > > >> >         try {
> > > > > > >> >             producer.abortTransaction();
> > > > > > >> >         } catch (KafkaException e) {
> > > > > > >> >             producer.close();
> > > > > > >> >             consumer.close();
> > > > > > >> >             throw e;
> > > > > > >> >         }
> > > > > > >> >     } catch (KafkaException e) {
> > > > > > >> >         producer.close();
> > > > > > >> >         consumer.close();
> > > > > > >> >         throw e;
> > > > > > >> >     }
> > > > > > >> > ...
> > > > > > >> >
> > > > > > >> > Guozhang
> > > > > > >> >
> > > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > > > >> reluctanthero104@gmail.com>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> >> Hey Guozhang,
> > > > > > >> >>
> > > > > > >> >> Jason and I were discussing the new API offline and decided
> > to
> > > > take
> > > > > > >> >> another
> > > > > > >> >> approach. Firstly, the reason not to invent a new API with
> > > > returned
> > > > > > >> >> boolean
> > > > > > >> >> flag is for compatibility consideration, since old EOS code
> > > > would not
> > > > > > >> know
> > > > > > >> >> that a given transaction commit was failed internally as
> they
> > > > don't
> > > > > > >> listen
> > > > > > >> >> to the output flag. Our proposed alternative solution is to
> > let
> > > > > > >> >> *commitTransaction
> > > > > > >> >> throw CommitFailedException whenever the commit failed with
> > > > non-fatal
> > > > > > >> >> exception*, so that on the caller side the handling logic
> > > > becomes:
> > > > > > >> >>
> > > > > > >> >> try {
> > > > > > >> >>         if (shouldCommit) {
> > > > > > >> >>             producer.commitTransaction();
> > > > > > >> >>         } else {
> > > > > > >> >>             resetToLastCommittedPositions(consumer);
> > > > > > >> >>             producer.abortTransaction();
> > > > > > >> >>         }
> > > > > > >> >>     } catch (CommitFailedException e) {
> > > > > > >> >>         // Transaction commit failed with abortable error,
> > user
> > > > could
> > > > > > >> >> reset
> > > > > > >> >>         // the application state and resume with a new
> > > > transaction.
> > > > > > The
> > > > > > >> >> root
> > > > > > >> >>         // cause was wrapped in the thrown exception.
> > > > > > >> >>         resetToLastCommittedPositions(consumer);
> > > > > > >> >>         producer.abortTransaction();
> > > > > > >> >>     } catch (KafkaException e) {
> > > > > > >> >>         producer.close();
> > > > > > >> >>         consumer.close();
> > > > > > >> >>         throw e;
> > > > > > >> >>     }
> > > > > > >> >>
> > > > > > >> >> This approach looks cleaner as all exception types other
> than
> > > > > > >> CommitFailed
> > > > > > >> >> will doom to be fatal, which is very easy to adopt for
> users.
> > > In
> > > > the
> > > > > > >> >> meantime, we still maintain the commitTxn behavior to throw
> > > > instead
> > > > > > of
> > > > > > >> >> silently failing.
> > > > > > >> >>
> > > > > > >> >> In addition, we decided to drop the recommendation to
> handle
> > > > > > >> >> TimeoutException and leave it to the users to make the
> call.
> > > The
> > > > > > >> downside
> > > > > > >> >> for blindly calling abortTxn upon timeout is that we could
> > > > result in
> > > > > > an
> > > > > > >> >> illegal state when the commit was already successful on the
> > > > broker
> > > > > > >> >> side. Without a good guarantee on the outcome,
> > overcomplicating
> > > > the
> > > > > > >> >> template should not be encouraged IMHO.
> > > > > > >> >>
> > > > > > >> >> Let me know your thoughts on the new approach here, thank
> > you!
> > > > > > >> >>
> > > > > > >> >> Best,
> > > > > > >> >> Boyang
> > > > > > >> >>
> > > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <
> > > > wangguoz@gmail.com>
> > > > > > >> >> wrote:
> > > > > > >> >>
> > > > > > >> >> > Thanks for your clarification on 2)/3), that makes sense.
> > > > > > >> >> >
> > > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > > > > > >> >> reluctanthero104@gmail.com>
> > > > > > >> >> > wrote:
> > > > > > >> >> >
> > > > > > >> >> > > Thanks for the input Guozhang, replied inline.
> > > > > > >> >> > >
> > > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> > > > > > wangguoz@gmail.com>
> > > > > > >> >> > wrote:
> > > > > > >> >> > >
> > > > > > >> >> > > > Hello Boyang,
> > > > > > >> >> > > >
> > > > > > >> >> > > > Thanks for the updated KIP. I read it again and have
> > the
> > > > > > >> following
> > > > > > >> >> > > > thoughts:
> > > > > > >> >> > > >
> > > > > > >> >> > > > 0. I'm a bit concerned that if commitTxn does not
> throw
> > > any
> > > > > > >> >> non-fatal
> > > > > > >> >> > > > exception, and instead we rely on the subsequent
> > beginTxn
> > > > call
> > > > > > to
> > > > > > >> >> > throw,
> > > > > > >> >> > > it
> > > > > > >> >> > > > may violate some callers with a pattern that relying
> on
> > > > > > >> commitTxn to
> > > > > > >> >> > > > succeed to make some non-rollback operations. For
> > > example:
> > > > > > >> >> > > >
> > > > > > >> >> > > > beginTxn()
> > > > > > >> >> > > > // do some read-write on my local DB
> > > > > > >> >> > > > commitTxn()
> > > > > > >> >> > > > // if commitTxn succeeds, then commit the DB
> > > > > > >> >> > > >
> > > > > > >> >> > > > -------------
> > > > > > >> >> > > >
> > > > > > >> >> > > > The issue is that, committing DB is a non-rollback
> > > > operation,
> > > > > > and
> > > > > > >> >> users
> > > > > > >> >> > > may
> > > > > > >> >> > > > just rely on commitTxn to return without error to
> make
> > > this
> > > > > > >> >> > non-rollback
> > > > > > >> >> > > > call. Of course we can just claim this pattern is not
> > > > > > legitimate
> > > > > > >> >> and is
> > > > > > >> >> > > not
> > > > > > >> >> > > > the right way of doing things, but many users may
> > > naturally
> > > > > > adopt
> > > > > > >> >> this
> > > > > > >> >> > > > pattern.
> > > > > > >> >> > > >
> > > > > > >> >> > > > So maybe we should still let commitTxn also throw
> > > non-fatal
> > > > > > >> >> exceptions,
> > > > > > >> >> > > in
> > > > > > >> >> > > > which case we would then call abortTxn again.
> > > > > > >> >> > > >
> > > > > > >> >> > > > But if we do this, the pattern becomes:
> > > > > > >> >> > > >
> > > > > > >> >> > > > try {
> > > > > > >> >> > > >    beginTxn()
> > > > > > >> >> > > >    // do something
> > > > > > >> >> > > > } catch (Exception) {
> > > > > > >> >> > > >    shouldCommit = false;
> > > > > > >> >> > > > }
> > > > > > >> >> > > >
> > > > > > >> >> > > > if (shouldCommit) {
> > > > > > >> >> > > >     try {
> > > > > > >> >> > > >         commitTxn()
> > > > > > >> >> > > >     } catch (...) {        // enumerate all fatal
> > > > exceptions
> > > > > > >> >> > > >         shutdown()
> > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > >> >> > > >         // non-fatal
> > > > > > >> >> > > >         shouldCommit = false;
> > > > > > >> >> > > >     }
> > > > > > >> >> > > > }
> > > > > > >> >> > > >
> > > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > > >> >> > > > try {
> > > > > > >> >> > > >         abortTxn()
> > > > > > >> >> > > >     } catch (KafkaException) {
> > > > > > >> >> > > >         // only throw fatal
> > > > > > >> >> > > >         shutdown()
> > > > > > >> >> > > >     }
> > > > > > >> >> > > > }
> > > > > > >> >> > > >
> > > > > > >> >> > > > ---------------------
> > > > > > >> >> > > >
> > > > > > >> >> > > > Which is much more complicated.
> > > > > > >> >> > > >
> > > > > > >> >> > > > Thank makes me think, the alternative we have
> discussed
> > > > offline
> > > > > > >> may
> > > > > > >> >> be
> > > > > > >> >> > > > better: let commitTxn() to return a boolean flag.
> > > > > > >> >> > > >
> > > > > > >> >> > > > * If it returns true, it means the commit succeeded.
> > > Users
> > > > can
> > > > > > >> >> > > comfortably
> > > > > > >> >> > > > continue and do any external non-rollback operations
> if
> > > > they
> > > > > > >> like.
> > > > > > >> >> > > > * If it returns false, it means the commit failed
> with
> > > > > > non-fatal
> > > > > > >> >> error
> > > > > > >> >> > > *AND
> > > > > > >> >> > > > the txn has been aborted*. Users do not need to call
> > > > abortTxn
> > > > > > >> again.
> > > > > > >> >> > > > * If it throws, then it means fatal errors. Users
> > should
> > > > shut
> > > > > > >> down
> > > > > > >> >> the
> > > > > > >> >> > > > client.
> > > > > > >> >> > > >
> > > > > > >> >> > > > In this case, the pattern becomes:
> > > > > > >> >> > > >
> > > > > > >> >> > > > try {
> > > > > > >> >> > > >    beginTxn()
> > > > > > >> >> > > >    // do something
> > > > > > >> >> > > > } catch (Exception) {
> > > > > > >> >> > > >    shouldCommit = false;
> > > > > > >> >> > > > }
> > > > > > >> >> > > >
> > > > > > >> >> > > > try {
> > > > > > >> >> > > >     if (shouldCommit) {
> > > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > > >> >> > > >     } else {
> > > > > > >> >> > > >         // reset offsets, rollback operations, etc
> > > > > > >> >> > > >         abortTxn()
> > > > > > >> >> > > >     }
> > > > > > >> >> > > > } catch (KafkaException) {
> > > > > > >> >> > > >     // only throw fatal
> > > > > > >> >> > > >     shutdown()
> > > > > > >> >> > > > }
> > > > > > >> >> > > >
> > > > > > >> >> > > > if (commitSucceeded)
> > > > > > >> >> > > >    // do other non-rollbackable things
> > > > > > >> >> > > > else
> > > > > > >> >> > > >    // reset offsets, rollback operations, etc
> > > > > > >> >> > > >
> > > > > > >> >> > > > -------------------------
> > > > > > >> >> > > >
> > > > > > >> >> > > > Of course, if we want to have better visibility into
> > what
> > > > > > caused
> > > > > > >> the
> > > > > > >> >> > > commit
> > > > > > >> >> > > > to fail and txn to abort. We can let the return type
> be
> > > an
> > > > enum
> > > > > > >> >> instead
> > > > > > >> >> > > of
> > > > > > >> >> > > > a flag. But my main idea is to still let the
> commitTxn
> > be
> > > > the
> > > > > > >> final
> > > > > > >> >> > point
> > > > > > >> >> > > > users can tell whether this txn succeeded or not,
> > instead
> > > > of
> > > > > > >> >> relying on
> > > > > > >> >> > > the
> > > > > > >> >> > > > next beginTxn() call.
> > > > > > >> >> > > >
> > > > > > >> >> > > > I agree that adding a boolean flag is indeed useful
> in
> > > this
> > > > > > case.
> > > > > > >> >> Will
> > > > > > >> >> > > update the KIP.
> > > > > > >> >> > >
> > > > > > >> >> > > 1. Re: "while maintaining the behavior to throw fatal
> > > > exception
> > > > > > in
> > > > > > >> >> raw"
> > > > > > >> >> > not
> > > > > > >> >> > > > sure what do you mean by "throw" here. Are you
> > proposing
> > > > the
> > > > > > >> >> callback
> > > > > > >> >> > > would
> > > > > > >> >> > > > *pass* (not throw) in any fatal exceptions when
> > triggered
> > > > > > without
> > > > > > >> >> > > wrapping?
> > > > > > >> >> > > >
> > > > > > >> >> > > > Yes, I want to say *pass*, the benefit is to make the
> > end
> > > > > > user's
> > > > > > >> >> > > expectation consistent
> > > > > > >> >> > > regarding exception handling.
> > > > > > >> >> > >
> > > > > > >> >> > >
> > > > > > >> >> > > > 2. I'm not sure I understand the claim regarding the
> > > > callback
> > > > > > >> that
> > > > > > >> >> "In
> > > > > > >> >> > > EOS
> > > > > > >> >> > > > setup, it is not required to handle the exception".
> Are
> > > you
> > > > > > >> >> proposing
> > > > > > >> >> > > that,
> > > > > > >> >> > > > e.g. in Streams, we do not try to handle any
> exceptions
> > > if
> > > > EOS
> > > > > > is
> > > > > > >> >> > enabled
> > > > > > >> >> > > > in the callback anymore, but just let commitTxn()
> > itself
> > > to
> > > > > > fail
> > > > > > >> to
> > > > > > >> >> be
> > > > > > >> >> > > > notified about the problem? Personally I think in
> > Streams
> > > > we
> > > > > > >> should
> > > > > > >> >> > just
> > > > > > >> >> > > > make the handling logic of the callback to be
> > consistent
> > > > > > >> regardless
> > > > > > >> >> of
> > > > > > >> >> > > the
> > > > > > >> >> > > > EOS settings (today we have different logic depending
> > on
> > > > this
> > > > > > >> logic,
> > > > > > >> >> > > which
> > > > > > >> >> > > > I think could be unified as well).
> > > > > > >> >> > > >
> > > > > > >> >> > > > My idea originates from the claim on send API:
> > > > > > >> >> > > "When used as part of a transaction, it is not
> necessary
> > to
> > > > > > define
> > > > > > >> a
> > > > > > >> >> > > callback or check the result of the future  in order to
> > > > detect
> > > > > > >> errors
> > > > > > >> >> > from
> > > > > > >> >> > > <code>send</code>. "
> > > > > > >> >> > > My understanding is that for EOS, the exception will be
> > > > detected
> > > > > > by
> > > > > > >> >> > calling
> > > > > > >> >> > > transactional APIs either way, so it is a duplicate
> > > handling
> > > > to
> > > > > > >> track
> > > > > > >> >> > > all the sendExceptions in RecordCollector. However, I
> > > looked
> > > > up
> > > > > > >> >> > > sendException is being used today as follow:
> > > > > > >> >> > >
> > > > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for any default
> > or
> > > > > > >> customized
> > > > > > >> >> > > exception handler to handle
> > > > > > >> >> > > 2. Stop collecting offset info or new exceptions
> > > > > > >> >> > > 3. Check and rethrow exceptions in close(), flush() or
> > new
> > > > send()
> > > > > > >> >> calls
> > > > > > >> >> > >
> > > > > > >> >> > > To my understanding, we should still honor the
> commitment
> > > to
> > > > #1
> > > > > > for
> > > > > > >> >> any
> > > > > > >> >> > > user customized implementation. The #2 does not really
> > > > affect our
> > > > > > >> >> > decision
> > > > > > >> >> > > upon EOS. The #3 here is still valuable as it could
> help
> > us
> > > > fail
> > > > > > >> fast
> > > > > > >> >> in
> > > > > > >> >> > > new send() instead of waiting to later stage of
> > processing.
> > > > In
> > > > > > that
> > > > > > >> >> > sense,
> > > > > > >> >> > > I agree we should continue to record send exceptions
> even
> > > > under
> > > > > > EOS
> > > > > > >> >> case
> > > > > > >> >> > to
> > > > > > >> >> > > ensure the strength of stream side Producer logic. On
> the
> > > > safer
> > > > > > >> side,
> > > > > > >> >> we
> > > > > > >> >> > no
> > > > > > >> >> > > longer need to wrap certain fatal exceptions like
> > > > ProducerFenced
> > > > > > as
> > > > > > >> >> > > TaskMigrated, since they should not crash the stream
> > thread
> > > > if
> > > > > > >> thrown
> > > > > > >> >> in
> > > > > > >> >> > > raw format, once we adopt the new processing model in
> the
> > > > send
> > > > > > >> phase.
> > > > > > >> >> > >
> > > > > > >> >> > >
> > > > > > >> >> > > >
> > > > > > >> >> > > > Guozhang
> > > > > > >> >> > > >
> > > > > > >> >> > > >
> > > > > > >> >> > > >
> > > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > > > > > >> >> > reluctanthero104@gmail.com>
> > > > > > >> >> > > > wrote:
> > > > > > >> >> > > >
> > > > > > >> >> > > > > Thanks for everyone's feedback so far. I have
> > polished
> > > > the
> > > > > > KIP
> > > > > > >> >> after
> > > > > > >> >> > > > > offline discussion with some folks working on EOS
> to
> > > > make the
> > > > > > >> >> > exception
> > > > > > >> >> > > > > handling more lightweight. The essential change is
> > that
> > > > we
> > > > > > are
> > > > > > >> not
> > > > > > >> >> > > > > inventing a new intermediate exception type, but
> > > instead
> > > > > > >> >> separating a
> > > > > > >> >> > > > full
> > > > > > >> >> > > > > transaction into two phases:
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > 1. The data transmission phase
> > > > > > >> >> > > > > 2. The commit phase
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > This way for any exception thrown from phase 1,
> will
> > be
> > > > an
> > > > > > >> >> indicator
> > > > > > >> >> > in
> > > > > > >> >> > > > > phase 2 whether we should do commit or abort, and
> > from
> > > > now on
> > > > > > >> >> > > > > `commitTransaction` should only throw fatal
> > exceptions,
> > > > > > >> similar to
> > > > > > >> >> > > > > `abortTransaction`, so that any KafkaException
> caught
> > > in
> > > > the
> > > > > > >> >> commit
> > > > > > >> >> > > phase
> > > > > > >> >> > > > > will be definitely fatal to crash the app. For more
> > > > advanced
> > > > > > >> users
> > > > > > >> >> > such
> > > > > > >> >> > > > as
> > > > > > >> >> > > > > Streams, we have the ability to further wrap
> selected
> > > > types
> > > > > > of
> > > > > > >> >> fatal
> > > > > > >> >> > > > > exceptions to trigger task migration if necessary.
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > More details in the KIP, feel free to take another
> > > look,
> > > > > > >> thanks!
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > > >> >> > > > > wrote:
> > > > > > >> >> > > > >
> > > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
> > > > > > >> >> bruno@confluent.io>
> > > > > > >> >> > > > wrote:
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >> Like Matthias, I do also not know the producer
> > > > internal
> > > > > > well
> > > > > > >> >> > enough
> > > > > > >> >> > > to
> > > > > > >> >> > > > > >> comment on the categorization. However, I think
> > > > having a
> > > > > > >> super
> > > > > > >> >> > > > exception
> > > > > > >> >> > > > > >> (e.g. RetriableException) that  encodes if an
> > > > exception is
> > > > > > >> >> fatal
> > > > > > >> >> > or
> > > > > > >> >> > > > not
> > > > > > >> >> > > > > >> is cleaner, better understandable and less
> > > > error-prone,
> > > > > > >> because
> > > > > > >> >> > > > ideally
> > > > > > >> >> > > > > >> when you add a new non-fatal exception in future
> > you
> > > > just
> > > > > > >> need
> > > > > > >> >> to
> > > > > > >> >> > > > think
> > > > > > >> >> > > > > >> about letting it inherit from the super
> exception
> > > and
> > > > all
> > > > > > >> the
> > > > > > >> >> rest
> > > > > > >> >> > > of
> > > > > > >> >> > > > > >> the code will just behave correctly without the
> > need
> > > > to
> > > > > > wrap
> > > > > > >> >> the
> > > > > > >> >> > new
> > > > > > >> >> > > > > >> exception into another exception each time it is
> > > > thrown
> > > > > > >> (maybe
> > > > > > >> >> it
> > > > > > >> >> > is
> > > > > > >> >> > > > > >> thrown at different location in the code).
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >> As far as I understand the following statement
> > from
> > > > your
> > > > > > >> >> previous
> > > > > > >> >> > > > e-mail
> > > > > > >> >> > > > > >> is the reason that currently such a super
> > exception
> > > > is not
> > > > > > >> >> > possible:
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >> "Right now we have RetriableException type, if
> we
> > > are
> > > > > > going
> > > > > > >> to
> > > > > > >> >> > add a
> > > > > > >> >> > > > > >> `ProducerRetriableException` type, we have to
> put
> > > > this new
> > > > > > >> >> > interface
> > > > > > >> >> > > > as
> > > > > > >> >> > > > > >> the parent of the RetriableException, because
> not
> > > all
> > > > > > thrown
> > > > > > >> >> > > non-fatal
> > > > > > >> >> > > > > >> exceptions are `retriable` in general for
> > producer"
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >> In the list of exceptions in your KIP, I found
> > > > non-fatal
> > > > > > >> >> > exceptions
> > > > > > >> >> > > > that
> > > > > > >> >> > > > > >> do not inherit from RetriableException. I guess
> > > those
> > > > are
> > > > > > >> the
> > > > > > >> >> ones
> > > > > > >> >> > > you
> > > > > > >> >> > > > > >> are referring to in your statement:
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > > >> >> > > > > >> TransactionAbortedException
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >> All of those exceptions are non-fatal and do not
> > > > inherit
> > > > > > >> from
> > > > > > >> >> > > > > >> RetriableException. Is there a reason for that?
> If
> > > > they
> > > > > > >> >> depended
> > > > > > >> >> > > from
> > > > > > >> >> > > > > >> RetriableException we would be a bit closer to a
> > > super
> > > > > > >> >> exception I
> > > > > > >> >> > > > > >> mention above.
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >> The reason is that sender may catch those
> > exceptions
> > > > in
> > > > > > the
> > > > > > >> >> > > > > > ProduceResponse, and it currently does infinite
> > > > > > >> >> > > > > > retries on RetriableException. To make sure we
> > could
> > > > > > trigger
> > > > > > >> the
> > > > > > >> >> > > > > > abortTransaction() by catching non-fatal thrown
> > > > > > >> >> > > > > > exceptions and reinitialize the txn state, we
> chose
> > > > not to
> > > > > > >> let
> > > > > > >> >> > those
> > > > > > >> >> > > > > > exceptions inherit RetriableException so that
> > > > > > >> >> > > > > > they won't cause infinite retry on sender.
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > > > > >> >> UnknownProducerIdException, I
> > > > > > >> >> > > > think
> > > > > > >> >> > > > > >> to understand that their fatality depends on the
> > > type
> > > > > > (i.e.
> > > > > > >> >> > > > > >> configuration) of the producer. That makes it
> > > > difficult to
> > > > > > >> >> have a
> > > > > > >> >> > > > super
> > > > > > >> >> > > > > >> exception that encodes the retriability as
> > mentioned
> > > > > > above.
> > > > > > >> >> Would
> > > > > > >> >> > it
> > > > > > >> >> > > > be
> > > > > > >> >> > > > > >> possible to introduce exceptions that inherit
> from
> > > > > > >> >> > > RetriableException
> > > > > > >> >> > > > > >> and that are thrown when those exceptions are
> > caught
> > > > from
> > > > > > >> the
> > > > > > >> >> > > brokers
> > > > > > >> >> > > > > >> and the type of the producer is such that the
> > > > exceptions
> > > > > > are
> > > > > > >> >> > > > retriable?
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >> Yea, I think in general the exception type
> mixing
> > is
> > > > > > >> difficult
> > > > > > >> >> to
> > > > > > >> >> > > get
> > > > > > >> >> > > > > > them all right. I have already proposed another
> > > > solution
> > > > > > >> based
> > > > > > >> >> on
> > > > > > >> >> > my
> > > > > > >> >> > > > > > offline discussion with some folks working on EOS
> > > > > > >> >> > > > > > to make the handling more straightforward for end
> > > users
> > > > > > >> without
> > > > > > >> >> the
> > > > > > >> >> > > > need
> > > > > > >> >> > > > > > to distinguish exception fatality.
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > > >> Best,
> > > > > > >> >> > > > > >> Bruno
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > > > >> >> > > > > >> > Thanks Boyang for the proposal! I made a pass
> > over
> > > > the
> > > > > > >> list
> > > > > > >> >> and
> > > > > > >> >> > > here
> > > > > > >> >> > > > > are
> > > > > > >> >> > > > > >> > some thoughts:
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> > 0) Although this is not part of the public
> API,
> > I
> > > > think
> > > > > > we
> > > > > > >> >> > should
> > > > > > >> >> > > > make
> > > > > > >> >> > > > > >> sure
> > > > > > >> >> > > > > >> > that the suggested pattern (i.e. user can
> always
> > > > call
> > > > > > >> >> abortTxn()
> > > > > > >> >> > > > when
> > > > > > >> >> > > > > >> > handling non-fatal errors) are indeed
> supported.
> > > > E.g. if
> > > > > > >> the
> > > > > > >> >> txn
> > > > > > >> >> > > is
> > > > > > >> >> > > > > >> already
> > > > > > >> >> > > > > >> > aborted by the broker side, then users can
> still
> > > > call
> > > > > > >> >> abortTxn
> > > > > > >> >> > > which
> > > > > > >> >> > > > > >> would
> > > > > > >> >> > > > > >> > not throw another exception but just be
> treated
> > > as a
> > > > > > >> no-op.
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> > 1) *ConcurrentTransactionsException*: I think
> > this
> > > > error
> > > > > > >> can
> > > > > > >> >> > also
> > > > > > >> >> > > be
> > > > > > >> >> > > > > >> > returned but not documented yet. This should
> be
> > a
> > > > > > >> non-fatal
> > > > > > >> >> > error.
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this error is
> > > > returned
> > > > > > from
> > > > > > >> >> > broker
> > > > > > >> >> > > > when
> > > > > > >> >> > > > > >> txn
> > > > > > >> >> > > > > >> > state transition failed (e.g. it is trying to
> > > > transit to
> > > > > > >> >> > > > > complete-commit
> > > > > > >> >> > > > > >> > while the current state is not
> prepare-commit).
> > > This
> > > > > > error
> > > > > > >> >> could
> > > > > > >> >> > > > > >> indicates
> > > > > > >> >> > > > > >> > a bug on the client internal code or the
> broker
> > > > code,
> > > > > > OR a
> > > > > > >> >> user
> > > > > > >> >> > > > error
> > > > > > >> >> > > > > >> --- a
> > > > > > >> >> > > > > >> > similar error is
> > ConcurrentTransactionsException,
> > > > i.e.
> > > > > > if
> > > > > > >> >> Kafka
> > > > > > >> >> > is
> > > > > > >> >> > > > > >> bug-free
> > > > > > >> >> > > > > >> > these exceptions would only be returned if
> users
> > > > try to
> > > > > > do
> > > > > > >> >> > > something
> > > > > > >> >> > > > > >> wrong,
> > > > > > >> >> > > > > >> > e.g. calling abortTxn right after a commitTxn,
> > > etc.
> > > > So
> > > > > > I'm
> > > > > > >> >> > > thinking
> > > > > > >> >> > > > it
> > > > > > >> >> > > > > >> > should be a non-fatal error instead of a fatal
> > > > error,
> > > > > > >> wdyt?
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> > 3) *KafkaException*: case i "indicates fatal
> > > > > > transactional
> > > > > > >> >> > > sequence
> > > > > > >> >> > > > > >> > (Fatal)", this is a bit conflicting with the
> > > > > > >> >> > > > *OutOfSequenceException*
> > > > > > >> >> > > > > >> that
> > > > > > >> >> > > > > >> > is treated as non-fatal. I guess your proposal
> > is
> > > > that
> > > > > > >> >> > > > > >> > OutOfOrderSequenceException would be treated
> > > either
> > > > as
> > > > > > >> fatal
> > > > > > >> >> > with
> > > > > > >> >> > > > > >> > transactional producer, or non-fatal with
> > > idempotent
> > > > > > >> >> producer,
> > > > > > >> >> > is
> > > > > > >> >> > > > that
> > > > > > >> >> > > > > >> > right? If the producer is only configured with
> > > > > > idempotency
> > > > > > >> >> but
> > > > > > >> >> > not
> > > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > > >> >> TransactionStateCorruptedException
> > > > > > >> >> > > for
> > > > > > >> >> > > > > >> > non-fatal errors would be confusing since
> users
> > > are
> > > > not
> > > > > > >> using
> > > > > > >> >> > > > > >> transactions
> > > > > > >> >> > > > > >> > at all.. So I suggest we always throw
> > > > > > >> OutOfSequenceException
> > > > > > >> >> > as-is
> > > > > > >> >> > > > > (i.e.
> > > > > > >> >> > > > > >> > not wrapped) no matter how the producer is
> > > > configured,
> > > > > > and
> > > > > > >> >> let
> > > > > > >> >> > the
> > > > > > >> >> > > > > >> caller
> > > > > > >> >> > > > > >> > decide how to handle it based on whether it is
> > > only
> > > > > > >> >> idempotent
> > > > > > >> >> > or
> > > > > > >> >> > > > > >> > transactional itself.
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the `send()`
> > > callback /
> > > > > > >> future
> > > > > > >> >> can
> > > > > > >> >> > > also
> > > > > > >> >> > > > > >> throw
> > > > > > >> >> > > > > >> > txn-related exceptions, I think this KIP
> should
> > > also
> > > > > > cover
> > > > > > >> >> this
> > > > > > >> >> > > API
> > > > > > >> >> > > > as
> > > > > > >> >> > > > > >> well?
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> > 5) This is related to 1/2) above: sometimes
> > those
> > > > > > >> non-fatal
> > > > > > >> >> > errors
> > > > > > >> >> > > > > like
> > > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not due
> to
> > > the
> > > > > > state
> > > > > > >> >> being
> > > > > > >> >> > > > > >> corrupted
> > > > > > >> >> > > > > >> > at the broker side, but maybe users are doing
> > > > something
> > > > > > >> >> wrong.
> > > > > > >> >> > So
> > > > > > >> >> > > > I'm
> > > > > > >> >> > > > > >> > wondering if we should further distinguish
> those
> > > > > > non-fatal
> > > > > > >> >> > errors
> > > > > > >> >> > > > > >> between
> > > > > > >> >> > > > > >> > a) those that are caused by Kafka itself,
> e.g. a
> > > > broker
> > > > > > >> timed
> > > > > > >> >> > out
> > > > > > >> >> > > > and
> > > > > > >> >> > > > > >> > aborted a txn and later an endTxn request is
> > > > received,
> > > > > > >> and b)
> > > > > > >> >> > the
> > > > > > >> >> > > > > user's
> > > > > > >> >> > > > > >> > API call pattern is incorrect, causing the
> > request
> > > > to be
> > > > > > >> >> > rejected
> > > > > > >> >> > > > with
> > > > > > >> >> > > > > >> an
> > > > > > >> >> > > > > >> > error code from the broker.
> > > > > > >> >> *TransactionStateCorruptedException*
> > > > > > >> >> > > > feels
> > > > > > >> >> > > > > >> to
> > > > > > >> >> > > > > >> > me more like for case a), but not case b).
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> > Guozhang
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > > >> >> > > > > >> > wrote:
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> >> Thanks Matthias, I think your proposal makes
> > > sense
> > > > as
> > > > > > >> well,
> > > > > > >> >> on
> > > > > > >> >> > > the
> > > > > > >> >> > > > > pro
> > > > > > >> >> > > > > >> side
> > > > > > >> >> > > > > >> >> we could have a universally agreed exception
> > type
> > > > to be
> > > > > > >> >> caught
> > > > > > >> >> > by
> > > > > > >> >> > > > the
> > > > > > >> >> > > > > >> user,
> > > > > > >> >> > > > > >> >> without having an extra layer on top of the
> > > actual
> > > > > > >> >> exceptions.
> > > > > > >> >> > I
> > > > > > >> >> > > > > could
> > > > > > >> >> > > > > >> see
> > > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > > >> >> > > > > >> >>
> > > > > > >> >> > > > > >> >> 1. The exception hierarchy will be more
> > complex.
> > > > Right
> > > > > > >> now
> > > > > > >> >> we
> > > > > > >> >> > > have
> > > > > > >> >> > > > > >> >> RetriableException type, if we are going to
> > add a
> > > > > > >> >> > > > > >> >> `ProducerRetriableException` type, we have to
> > put
> > > > this
> > > > > > >> new
> > > > > > >> >> > > > interface
> > > > > > >> >> > > > > >> as the
> > > > > > >> >> > > > > >> >> parent of the RetriableException, because not
> > all
> > > > > > thrown
> > > > > > >> >> > > non-fatal
> > > > > > >> >> > > > > >> >> exceptions are `retriable` in general for
> > > > producer, for
> > > > > > >> >> example
> > > > > > >> >> > > > > >> >> <
> > > > > > >> >> > > > > >> >>
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > >
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > > >> >> > > > > >> >>> .
> > > > > > >> >> > > > > >> >> We could have an empty interface
> > > > > > >> >> `ProducerRetriableException`
> > > > > > >> >> > to
> > > > > > >> >> > > > let
> > > > > > >> >> > > > > >> all
> > > > > > >> >> > > > > >> >> the thrown exceptions implement for sure, but
> > > it's
> > > > a
> > > > > > bit
> > > > > > >> >> > > unorthodox
> > > > > > >> >> > > > > in
> > > > > > >> >> > > > > >> the
> > > > > > >> >> > > > > >> >> way we deal with exceptions in general.
> > > > > > >> >> > > > > >> >>
> > > > > > >> >> > > > > >> >> 2. There are cases where we throw a
> > > KafkaException
> > > > > > >> wrapping
> > > > > > >> >> > > another
> > > > > > >> >> > > > > >> >> KafkaException as either fatal or non-fatal.
> If
> > > we
> > > > use
> > > > > > an
> > > > > > >> >> > > interface
> > > > > > >> >> > > > > to
> > > > > > >> >> > > > > >> >> solve #1, it is also required to implement
> > > another
> > > > > > >> bloated
> > > > > > >> >> > > > exception
> > > > > > >> >> > > > > >> class
> > > > > > >> >> > > > > >> >> which could replace KafkaException type, as
> we
> > > > couldn't
> > > > > > >> mark
> > > > > > >> >> > > > > >> KafkaException
> > > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > > >> >> > > > > >> >>
> > > > > > >> >> > > > > >> >> 3. In terms of the encapsulation, wrapping
> > means
> > > we
> > > > > > could
> > > > > > >> >> limit
> > > > > > >> >> > > the
> > > > > > >> >> > > > > >> scope
> > > > > > >> >> > > > > >> >> of affection to the producer only, which is
> > > > important
> > > > > > >> since
> > > > > > >> >> we
> > > > > > >> >> > > > don't
> > > > > > >> >> > > > > >> want
> > > > > > >> >> > > > > >> >> shared exception types to implement a
> > > > producer-related
> > > > > > >> >> > interface,
> > > > > > >> >> > > > > such
> > > > > > >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> > > > > > >> >> > > > > >> >>
> > > > > > >> >> > > > > >> >> Best,
> > > > > > >> >> > > > > >> >> Boyang
> > > > > > >> >> > > > > >> >>
> > > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J.
> Sax
> > <
> > > > > > >> >> > mjsax@apache.org
> > > > > > >> >> > > >
> > > > > > >> >> > > > > >> wrote:
> > > > > > >> >> > > > > >> >>
> > > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > > >> >> > > > > >> >>>
> > > > > > >> >> > > > > >> >>> Overall, categorizing exceptions makes a lot
> > of
> > > > sense.
> > > > > > >> As I
> > > > > > >> >> > > don't
> > > > > > >> >> > > > > know
> > > > > > >> >> > > > > >> >>> the producer internals well enough, I cannot
> > > > comment
> > > > > > on
> > > > > > >> the
> > > > > > >> >> > > > > >> >>> categorization in detail though.
> > > > > > >> >> > > > > >> >>>
> > > > > > >> >> > > > > >> >>> What I am wondering is, if we should
> introduce
> > > an
> > > > > > >> exception
> > > > > > >> >> > > > > interface
> > > > > > >> >> > > > > >> >>> that non-fatal exception would implement
> > instead
> > > > of
> > > > > > >> >> creating a
> > > > > > >> >> > > new
> > > > > > >> >> > > > > >> class
> > > > > > >> >> > > > > >> >>> that will wrap non-fatal exceptions? What
> > would
> > > > be the
> > > > > > >> >> > pros/cons
> > > > > > >> >> > > > for
> > > > > > >> >> > > > > >> >>> both designs?
> > > > > > >> >> > > > > >> >>>
> > > > > > >> >> > > > > >> >>>
> > > > > > >> >> > > > > >> >>> -Matthias
> > > > > > >> >> > > > > >> >>>
> > > > > > >> >> > > > > >> >>>
> > > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > > > > >> >> > > > > >> >>>> Hey there,
> > > > > > >> >> > > > > >> >>>>
> > > > > > >> >> > > > > >> >>>> I would like to start a discussion thread
> for
> > > > > > KIP-691:
> > > > > > >> >> > > > > >> >>>>
> > > > > > >> >> > > > > >> >>>
> > > > > > >> >> > > > > >> >>
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > >
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >>
> > > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > > >> >> > > > > >> >>>>
> > > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify the exception
> > > > handling
> > > > > > >> logic
> > > > > > >> >> > for
> > > > > > >> >> > > > > >> >>>> transactional Producer users by classifying
> > > > fatal and
> > > > > > >> >> > non-fatal
> > > > > > >> >> > > > > >> >>> exceptions
> > > > > > >> >> > > > > >> >>>> and throw them correspondingly for easier
> > catch
> > > > and
> > > > > > >> retry.
> > > > > > >> >> > Let
> > > > > > >> >> > > me
> > > > > > >> >> > > > > >> know
> > > > > > >> >> > > > > >> >>> what
> > > > > > >> >> > > > > >> >>>> you think.
> > > > > > >> >> > > > > >> >>>>
> > > > > > >> >> > > > > >> >>>> Best,
> > > > > > >> >> > > > > >> >>>> Boyang
> > > > > > >> >> > > > > >> >>>>
> > > > > > >> >> > > > > >> >>>
> > > > > > >> >> > > > > >> >>
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >> >
> > > > > > >> >> > > > > >>
> > > > > > >> >> > > > > >
> > > > > > >> >> > > > >
> > > > > > >> >> > > >
> > > > > > >> >> > > >
> > > > > > >> >> > > > --
> > > > > > >> >> > > > -- Guozhang
> > > > > > >> >> > > >
> > > > > > >> >> > >
> > > > > > >> >> >
> > > > > > >> >> >
> > > > > > >> >> > --
> > > > > > >> >> > -- Guozhang
> > > > > > >> >> >
> > > > > > >> >>
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > --
> > > > > > >> > -- Guozhang
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> -- Guozhang
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Boyang,

It seems like a reasonable suggestion. I wonder if a flag is sufficient
though. The current `Callback` documentation treats "fatal" errors from the
perspective of the individual message that was sent.

```
    *                  Non-Retriable exceptions (fatal, the message will
never be sent):
```

However, we also have fatal errors from the perspective of the transaction
(e.g. when the producer gets fenced). Perhaps that suggests we need
something richer than a boolean flag:

At a high level, I think the following cases are possible:

- message rejected (e.g. message too large, invalid topic)
- delivery failed after retries/delivery timeout (e.g. timeout, crc error,
not enough replicas)
- transaction failed (e.g. producer fenced, invalid transaction state)

Perhaps instead we can have a type like the following:

class SendFailure {
  FailureType failureType;
  Exception cause;
}

enum FailureType {
  MESSSAGE_REJECTED, DELIVERY_FAILED, TRANSACTION_FAILED
}

(Not married to any of these names, just a starting point.)

Then we add a new `onCompletion` as you've suggested:

default void onCompletion(RecordMetadata metadata, SendFailure failure) {
  this.onCompletion(metadata, failure.cause());
}

This would give streams and other applications enough information to know
whether the message can be retried and whether the transaction can be
aborted.

What do you think?

-Jason


On Wed, Jan 27, 2021 at 9:51 PM Boyang Chen <re...@gmail.com>
wrote:

> Thanks Jason for the thoughts.
>
> On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi Boyang,
> >
> > Thanks for the iterations here. I think this is something we should have
> > done a long time ago. It sounds like there are two API changes here:
> >
> > 1. We are introducing the `CommitFailedException` to wrap abortable
> > errors that are raised from `commitTransaction`. This sounds fine to me.
> As
> > far as I know, the only case we might need this is when we add support to
> > let producers recover from coordinator timeouts. Are there any others?
> >
> > I think the purpose here is to ensure non-fatal exceptions are unified
> under the same
> exception umbrella, to make the proceeding to abort any ongoing transaction
> much easier.
> I don't think `coordinator timeouts` is the only case to recover here,
> since we have other
> non-fatal exceptions such as UnknownPid.
>
> 2. We are wrapping non-fatal errors raised from `send` as `KafkaException`.
> > The motivation for this is less clear to me and it doesn't look like the
> > example from the KIP depends on it. My concern here is compatibility.
> > Currently we have the following documentation for the `Callback` API:
> >
> > ```
> >      *                  Non-Retriable exceptions (fatal, the message will
> > never be sent):
> >      *
> >      *                  InvalidTopicException
> >      *                  OffsetMetadataTooLargeException
> >      *                  RecordBatchTooLargeException
> >      *                  RecordTooLargeException
> >      *                  UnknownServerException
> >      *                  UnknownProducerIdException
> >      *                  InvalidProducerEpochException
> >      *
> >      *                  Retriable exceptions (transient, may be covered
> by
> > increasing #.retries):
> >      *
> >      *                  CorruptRecordException
> >      *                  InvalidMetadataException
> >      *                  NotEnoughReplicasAfterAppendException
> >      *                  NotEnoughReplicasException
> >      *                  OffsetOutOfRangeException
> >      *                  TimeoutException
> >      *                  UnknownTopicOrPartitionException
> > ```
> >
> > If we wrap all the retriable exceptions documented here as
> > `KafkaException`, wouldn't that break any error handling that users might
> > already have? it's gonna introduce a compatibility issue.
> >
> > The original intention was to simplify `send` callback error handling by
> doing exception wrapping, as on Streams level
> we have to prepare an exhausting list of exceptions to catch as fatal, and
> the same lengthy list to catch as
> non-fatal. It would be much easier if we got `hints` from the callback.
> However,
> I agree there is a compatibility concern, what about deprecating the
> existing:
>
> void onCompletion(RecordMetadata metadata, Exception exception)
>
> and replace it with:
>
> default void onCompletion(RecordMetadata metadata, Exception exception,
> boolean isFatal) {
>   this.onCompletion(metadata, exception);
> }
>
> to make sure new users get the benefit of understanding the fatality based
> on the info presented by the producer?
>
> Thanks,
> > Jason
> >
> >
> > On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <in...@hiringuru.com> wrote:
> >
> > > Why  we are receiving all emails kindly remove us from
> > > dev@kafka.apache.org we don't want to receive emails anymore.
> > >
> > > Thanks
> > > > On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com> wrote:
> > > >
> > > >
> > > > Thanks Boyang, yes I think I was confused about the different
> handling
> > of
> > > > two abortTxn calls, and now I get it was not intentional. I think I
> do
> > > not
> > > > have more concerns.
> > > >
> > > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks for the clarification Guozhang, I got your point that we
> want
> > to
> > > > > have a consistent handling of fatal exceptions being thrown from
> the
> > > > > abortTxn. I modified the current template to move the fatal
> exception
> > > > > try-catch outside of the processing loop to make sure we could get
> a
> > > chance
> > > > > to close consumer/producer modules. Let me know what you think.
> > > > >
> > > > > Best,
> > > > > Boyang
> > > > >
> > > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > My understanding is that abortTransaction would only throw when
> the
> > > > > > producer is in fatal state. For CommitFailed, the producer should
> > > still
> > > > > be
> > > > > > in the abortable error state, so that abortTransaction call would
> > not
> > > > > throw.
> > > > > >
> > > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <
> wangguoz@gmail.com
> > >
> > > > > wrote:
> > > > > >
> > > > > >> Or are you going to maintain some internal state such that, the
> > > > > >> `abortTransaction` in the catch block would never throw again?
> > > > > >>
> > > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> > wangguoz@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Boyang/Jason,
> > > > > >> >
> > > > > >> > I've also thought about this (i.e. using CommitFailed for all
> > > > > >> non-fatal),
> > > > > >> > but what I'm pondering is that, in the catch (CommitFailed)
> > block,
> > > > > what
> > > > > >> > would happen if the `producer.abortTransaction();` throws
> again?
> > > > > should
> > > > > >> > that be captured as a fatal and cause the client to close
> again.
> > > > > >> >
> > > > > >> > If yes, then naively the pattern would be:
> > > > > >> >
> > > > > >> > ...
> > > > > >> > catch (CommitFailedException e) {
> > > > > >> >         // Transaction commit failed with abortable error,
> user
> > > could
> > > > > >> reset
> > > > > >> >         // the application state and resume with a new
> > > transaction.
> > > > > The
> > > > > >> > root
> > > > > >> >         // cause was wrapped in the thrown exception.
> > > > > >> >         resetToLastCommittedPositions(consumer);
> > > > > >> >         try {
> > > > > >> >             producer.abortTransaction();
> > > > > >> >         } catch (KafkaException e) {
> > > > > >> >             producer.close();
> > > > > >> >             consumer.close();
> > > > > >> >             throw e;
> > > > > >> >         }
> > > > > >> >     } catch (KafkaException e) {
> > > > > >> >         producer.close();
> > > > > >> >         consumer.close();
> > > > > >> >         throw e;
> > > > > >> >     }
> > > > > >> > ...
> > > > > >> >
> > > > > >> > Guozhang
> > > > > >> >
> > > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > > >> reluctanthero104@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> >> Hey Guozhang,
> > > > > >> >>
> > > > > >> >> Jason and I were discussing the new API offline and decided
> to
> > > take
> > > > > >> >> another
> > > > > >> >> approach. Firstly, the reason not to invent a new API with
> > > returned
> > > > > >> >> boolean
> > > > > >> >> flag is for compatibility consideration, since old EOS code
> > > would not
> > > > > >> know
> > > > > >> >> that a given transaction commit was failed internally as they
> > > don't
> > > > > >> listen
> > > > > >> >> to the output flag. Our proposed alternative solution is to
> let
> > > > > >> >> *commitTransaction
> > > > > >> >> throw CommitFailedException whenever the commit failed with
> > > non-fatal
> > > > > >> >> exception*, so that on the caller side the handling logic
> > > becomes:
> > > > > >> >>
> > > > > >> >> try {
> > > > > >> >>         if (shouldCommit) {
> > > > > >> >>             producer.commitTransaction();
> > > > > >> >>         } else {
> > > > > >> >>             resetToLastCommittedPositions(consumer);
> > > > > >> >>             producer.abortTransaction();
> > > > > >> >>         }
> > > > > >> >>     } catch (CommitFailedException e) {
> > > > > >> >>         // Transaction commit failed with abortable error,
> user
> > > could
> > > > > >> >> reset
> > > > > >> >>         // the application state and resume with a new
> > > transaction.
> > > > > The
> > > > > >> >> root
> > > > > >> >>         // cause was wrapped in the thrown exception.
> > > > > >> >>         resetToLastCommittedPositions(consumer);
> > > > > >> >>         producer.abortTransaction();
> > > > > >> >>     } catch (KafkaException e) {
> > > > > >> >>         producer.close();
> > > > > >> >>         consumer.close();
> > > > > >> >>         throw e;
> > > > > >> >>     }
> > > > > >> >>
> > > > > >> >> This approach looks cleaner as all exception types other than
> > > > > >> CommitFailed
> > > > > >> >> will doom to be fatal, which is very easy to adopt for users.
> > In
> > > the
> > > > > >> >> meantime, we still maintain the commitTxn behavior to throw
> > > instead
> > > > > of
> > > > > >> >> silently failing.
> > > > > >> >>
> > > > > >> >> In addition, we decided to drop the recommendation to handle
> > > > > >> >> TimeoutException and leave it to the users to make the call.
> > The
> > > > > >> downside
> > > > > >> >> for blindly calling abortTxn upon timeout is that we could
> > > result in
> > > > > an
> > > > > >> >> illegal state when the commit was already successful on the
> > > broker
> > > > > >> >> side. Without a good guarantee on the outcome,
> overcomplicating
> > > the
> > > > > >> >> template should not be encouraged IMHO.
> > > > > >> >>
> > > > > >> >> Let me know your thoughts on the new approach here, thank
> you!
> > > > > >> >>
> > > > > >> >> Best,
> > > > > >> >> Boyang
> > > > > >> >>
> > > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <
> > > wangguoz@gmail.com>
> > > > > >> >> wrote:
> > > > > >> >>
> > > > > >> >> > Thanks for your clarification on 2)/3), that makes sense.
> > > > > >> >> >
> > > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > > > > >> >> reluctanthero104@gmail.com>
> > > > > >> >> > wrote:
> > > > > >> >> >
> > > > > >> >> > > Thanks for the input Guozhang, replied inline.
> > > > > >> >> > >
> > > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> > > > > wangguoz@gmail.com>
> > > > > >> >> > wrote:
> > > > > >> >> > >
> > > > > >> >> > > > Hello Boyang,
> > > > > >> >> > > >
> > > > > >> >> > > > Thanks for the updated KIP. I read it again and have
> the
> > > > > >> following
> > > > > >> >> > > > thoughts:
> > > > > >> >> > > >
> > > > > >> >> > > > 0. I'm a bit concerned that if commitTxn does not throw
> > any
> > > > > >> >> non-fatal
> > > > > >> >> > > > exception, and instead we rely on the subsequent
> beginTxn
> > > call
> > > > > to
> > > > > >> >> > throw,
> > > > > >> >> > > it
> > > > > >> >> > > > may violate some callers with a pattern that relying on
> > > > > >> commitTxn to
> > > > > >> >> > > > succeed to make some non-rollback operations. For
> > example:
> > > > > >> >> > > >
> > > > > >> >> > > > beginTxn()
> > > > > >> >> > > > // do some read-write on my local DB
> > > > > >> >> > > > commitTxn()
> > > > > >> >> > > > // if commitTxn succeeds, then commit the DB
> > > > > >> >> > > >
> > > > > >> >> > > > -------------
> > > > > >> >> > > >
> > > > > >> >> > > > The issue is that, committing DB is a non-rollback
> > > operation,
> > > > > and
> > > > > >> >> users
> > > > > >> >> > > may
> > > > > >> >> > > > just rely on commitTxn to return without error to make
> > this
> > > > > >> >> > non-rollback
> > > > > >> >> > > > call. Of course we can just claim this pattern is not
> > > > > legitimate
> > > > > >> >> and is
> > > > > >> >> > > not
> > > > > >> >> > > > the right way of doing things, but many users may
> > naturally
> > > > > adopt
> > > > > >> >> this
> > > > > >> >> > > > pattern.
> > > > > >> >> > > >
> > > > > >> >> > > > So maybe we should still let commitTxn also throw
> > non-fatal
> > > > > >> >> exceptions,
> > > > > >> >> > > in
> > > > > >> >> > > > which case we would then call abortTxn again.
> > > > > >> >> > > >
> > > > > >> >> > > > But if we do this, the pattern becomes:
> > > > > >> >> > > >
> > > > > >> >> > > > try {
> > > > > >> >> > > >    beginTxn()
> > > > > >> >> > > >    // do something
> > > > > >> >> > > > } catch (Exception) {
> > > > > >> >> > > >    shouldCommit = false;
> > > > > >> >> > > > }
> > > > > >> >> > > >
> > > > > >> >> > > > if (shouldCommit) {
> > > > > >> >> > > >     try {
> > > > > >> >> > > >         commitTxn()
> > > > > >> >> > > >     } catch (...) {        // enumerate all fatal
> > > exceptions
> > > > > >> >> > > >         shutdown()
> > > > > >> >> > > >     } catch (KafkaException) {
> > > > > >> >> > > >         // non-fatal
> > > > > >> >> > > >         shouldCommit = false;
> > > > > >> >> > > >     }
> > > > > >> >> > > > }
> > > > > >> >> > > >
> > > > > >> >> > > > if (!shouldCommit && running()) {
> > > > > >> >> > > > try {
> > > > > >> >> > > >         abortTxn()
> > > > > >> >> > > >     } catch (KafkaException) {
> > > > > >> >> > > >         // only throw fatal
> > > > > >> >> > > >         shutdown()
> > > > > >> >> > > >     }
> > > > > >> >> > > > }
> > > > > >> >> > > >
> > > > > >> >> > > > ---------------------
> > > > > >> >> > > >
> > > > > >> >> > > > Which is much more complicated.
> > > > > >> >> > > >
> > > > > >> >> > > > Thank makes me think, the alternative we have discussed
> > > offline
> > > > > >> may
> > > > > >> >> be
> > > > > >> >> > > > better: let commitTxn() to return a boolean flag.
> > > > > >> >> > > >
> > > > > >> >> > > > * If it returns true, it means the commit succeeded.
> > Users
> > > can
> > > > > >> >> > > comfortably
> > > > > >> >> > > > continue and do any external non-rollback operations if
> > > they
> > > > > >> like.
> > > > > >> >> > > > * If it returns false, it means the commit failed with
> > > > > non-fatal
> > > > > >> >> error
> > > > > >> >> > > *AND
> > > > > >> >> > > > the txn has been aborted*. Users do not need to call
> > > abortTxn
> > > > > >> again.
> > > > > >> >> > > > * If it throws, then it means fatal errors. Users
> should
> > > shut
> > > > > >> down
> > > > > >> >> the
> > > > > >> >> > > > client.
> > > > > >> >> > > >
> > > > > >> >> > > > In this case, the pattern becomes:
> > > > > >> >> > > >
> > > > > >> >> > > > try {
> > > > > >> >> > > >    beginTxn()
> > > > > >> >> > > >    // do something
> > > > > >> >> > > > } catch (Exception) {
> > > > > >> >> > > >    shouldCommit = false;
> > > > > >> >> > > > }
> > > > > >> >> > > >
> > > > > >> >> > > > try {
> > > > > >> >> > > >     if (shouldCommit) {
> > > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > > >> >> > > >     } else {
> > > > > >> >> > > >         // reset offsets, rollback operations, etc
> > > > > >> >> > > >         abortTxn()
> > > > > >> >> > > >     }
> > > > > >> >> > > > } catch (KafkaException) {
> > > > > >> >> > > >     // only throw fatal
> > > > > >> >> > > >     shutdown()
> > > > > >> >> > > > }
> > > > > >> >> > > >
> > > > > >> >> > > > if (commitSucceeded)
> > > > > >> >> > > >    // do other non-rollbackable things
> > > > > >> >> > > > else
> > > > > >> >> > > >    // reset offsets, rollback operations, etc
> > > > > >> >> > > >
> > > > > >> >> > > > -------------------------
> > > > > >> >> > > >
> > > > > >> >> > > > Of course, if we want to have better visibility into
> what
> > > > > caused
> > > > > >> the
> > > > > >> >> > > commit
> > > > > >> >> > > > to fail and txn to abort. We can let the return type be
> > an
> > > enum
> > > > > >> >> instead
> > > > > >> >> > > of
> > > > > >> >> > > > a flag. But my main idea is to still let the commitTxn
> be
> > > the
> > > > > >> final
> > > > > >> >> > point
> > > > > >> >> > > > users can tell whether this txn succeeded or not,
> instead
> > > of
> > > > > >> >> relying on
> > > > > >> >> > > the
> > > > > >> >> > > > next beginTxn() call.
> > > > > >> >> > > >
> > > > > >> >> > > > I agree that adding a boolean flag is indeed useful in
> > this
> > > > > case.
> > > > > >> >> Will
> > > > > >> >> > > update the KIP.
> > > > > >> >> > >
> > > > > >> >> > > 1. Re: "while maintaining the behavior to throw fatal
> > > exception
> > > > > in
> > > > > >> >> raw"
> > > > > >> >> > not
> > > > > >> >> > > > sure what do you mean by "throw" here. Are you
> proposing
> > > the
> > > > > >> >> callback
> > > > > >> >> > > would
> > > > > >> >> > > > *pass* (not throw) in any fatal exceptions when
> triggered
> > > > > without
> > > > > >> >> > > wrapping?
> > > > > >> >> > > >
> > > > > >> >> > > > Yes, I want to say *pass*, the benefit is to make the
> end
> > > > > user's
> > > > > >> >> > > expectation consistent
> > > > > >> >> > > regarding exception handling.
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > > > 2. I'm not sure I understand the claim regarding the
> > > callback
> > > > > >> that
> > > > > >> >> "In
> > > > > >> >> > > EOS
> > > > > >> >> > > > setup, it is not required to handle the exception". Are
> > you
> > > > > >> >> proposing
> > > > > >> >> > > that,
> > > > > >> >> > > > e.g. in Streams, we do not try to handle any exceptions
> > if
> > > EOS
> > > > > is
> > > > > >> >> > enabled
> > > > > >> >> > > > in the callback anymore, but just let commitTxn()
> itself
> > to
> > > > > fail
> > > > > >> to
> > > > > >> >> be
> > > > > >> >> > > > notified about the problem? Personally I think in
> Streams
> > > we
> > > > > >> should
> > > > > >> >> > just
> > > > > >> >> > > > make the handling logic of the callback to be
> consistent
> > > > > >> regardless
> > > > > >> >> of
> > > > > >> >> > > the
> > > > > >> >> > > > EOS settings (today we have different logic depending
> on
> > > this
> > > > > >> logic,
> > > > > >> >> > > which
> > > > > >> >> > > > I think could be unified as well).
> > > > > >> >> > > >
> > > > > >> >> > > > My idea originates from the claim on send API:
> > > > > >> >> > > "When used as part of a transaction, it is not necessary
> to
> > > > > define
> > > > > >> a
> > > > > >> >> > > callback or check the result of the future  in order to
> > > detect
> > > > > >> errors
> > > > > >> >> > from
> > > > > >> >> > > <code>send</code>. "
> > > > > >> >> > > My understanding is that for EOS, the exception will be
> > > detected
> > > > > by
> > > > > >> >> > calling
> > > > > >> >> > > transactional APIs either way, so it is a duplicate
> > handling
> > > to
> > > > > >> track
> > > > > >> >> > > all the sendExceptions in RecordCollector. However, I
> > looked
> > > up
> > > > > >> >> > > sendException is being used today as follow:
> > > > > >> >> > >
> > > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for any default
> or
> > > > > >> customized
> > > > > >> >> > > exception handler to handle
> > > > > >> >> > > 2. Stop collecting offset info or new exceptions
> > > > > >> >> > > 3. Check and rethrow exceptions in close(), flush() or
> new
> > > send()
> > > > > >> >> calls
> > > > > >> >> > >
> > > > > >> >> > > To my understanding, we should still honor the commitment
> > to
> > > #1
> > > > > for
> > > > > >> >> any
> > > > > >> >> > > user customized implementation. The #2 does not really
> > > affect our
> > > > > >> >> > decision
> > > > > >> >> > > upon EOS. The #3 here is still valuable as it could help
> us
> > > fail
> > > > > >> fast
> > > > > >> >> in
> > > > > >> >> > > new send() instead of waiting to later stage of
> processing.
> > > In
> > > > > that
> > > > > >> >> > sense,
> > > > > >> >> > > I agree we should continue to record send exceptions even
> > > under
> > > > > EOS
> > > > > >> >> case
> > > > > >> >> > to
> > > > > >> >> > > ensure the strength of stream side Producer logic. On the
> > > safer
> > > > > >> side,
> > > > > >> >> we
> > > > > >> >> > no
> > > > > >> >> > > longer need to wrap certain fatal exceptions like
> > > ProducerFenced
> > > > > as
> > > > > >> >> > > TaskMigrated, since they should not crash the stream
> thread
> > > if
> > > > > >> thrown
> > > > > >> >> in
> > > > > >> >> > > raw format, once we adopt the new processing model in the
> > > send
> > > > > >> phase.
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > > >
> > > > > >> >> > > > Guozhang
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > > > > >> >> > reluctanthero104@gmail.com>
> > > > > >> >> > > > wrote:
> > > > > >> >> > > >
> > > > > >> >> > > > > Thanks for everyone's feedback so far. I have
> polished
> > > the
> > > > > KIP
> > > > > >> >> after
> > > > > >> >> > > > > offline discussion with some folks working on EOS to
> > > make the
> > > > > >> >> > exception
> > > > > >> >> > > > > handling more lightweight. The essential change is
> that
> > > we
> > > > > are
> > > > > >> not
> > > > > >> >> > > > > inventing a new intermediate exception type, but
> > instead
> > > > > >> >> separating a
> > > > > >> >> > > > full
> > > > > >> >> > > > > transaction into two phases:
> > > > > >> >> > > > >
> > > > > >> >> > > > > 1. The data transmission phase
> > > > > >> >> > > > > 2. The commit phase
> > > > > >> >> > > > >
> > > > > >> >> > > > > This way for any exception thrown from phase 1, will
> be
> > > an
> > > > > >> >> indicator
> > > > > >> >> > in
> > > > > >> >> > > > > phase 2 whether we should do commit or abort, and
> from
> > > now on
> > > > > >> >> > > > > `commitTransaction` should only throw fatal
> exceptions,
> > > > > >> similar to
> > > > > >> >> > > > > `abortTransaction`, so that any KafkaException caught
> > in
> > > the
> > > > > >> >> commit
> > > > > >> >> > > phase
> > > > > >> >> > > > > will be definitely fatal to crash the app. For more
> > > advanced
> > > > > >> users
> > > > > >> >> > such
> > > > > >> >> > > > as
> > > > > >> >> > > > > Streams, we have the ability to further wrap selected
> > > types
> > > > > of
> > > > > >> >> fatal
> > > > > >> >> > > > > exceptions to trigger task migration if necessary.
> > > > > >> >> > > > >
> > > > > >> >> > > > > More details in the KIP, feel free to take another
> > look,
> > > > > >> thanks!
> > > > > >> >> > > > >
> > > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > > > > >> >> > > reluctanthero104@gmail.com>
> > > > > >> >> > > > > wrote:
> > > > > >> >> > > > >
> > > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
> > > > > >> >> bruno@confluent.io>
> > > > > >> >> > > > wrote:
> > > > > >> >> > > > > >
> > > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >> Like Matthias, I do also not know the producer
> > > internal
> > > > > well
> > > > > >> >> > enough
> > > > > >> >> > > to
> > > > > >> >> > > > > >> comment on the categorization. However, I think
> > > having a
> > > > > >> super
> > > > > >> >> > > > exception
> > > > > >> >> > > > > >> (e.g. RetriableException) that  encodes if an
> > > exception is
> > > > > >> >> fatal
> > > > > >> >> > or
> > > > > >> >> > > > not
> > > > > >> >> > > > > >> is cleaner, better understandable and less
> > > error-prone,
> > > > > >> because
> > > > > >> >> > > > ideally
> > > > > >> >> > > > > >> when you add a new non-fatal exception in future
> you
> > > just
> > > > > >> need
> > > > > >> >> to
> > > > > >> >> > > > think
> > > > > >> >> > > > > >> about letting it inherit from the super exception
> > and
> > > all
> > > > > >> the
> > > > > >> >> rest
> > > > > >> >> > > of
> > > > > >> >> > > > > >> the code will just behave correctly without the
> need
> > > to
> > > > > wrap
> > > > > >> >> the
> > > > > >> >> > new
> > > > > >> >> > > > > >> exception into another exception each time it is
> > > thrown
> > > > > >> (maybe
> > > > > >> >> it
> > > > > >> >> > is
> > > > > >> >> > > > > >> thrown at different location in the code).
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >> As far as I understand the following statement
> from
> > > your
> > > > > >> >> previous
> > > > > >> >> > > > e-mail
> > > > > >> >> > > > > >> is the reason that currently such a super
> exception
> > > is not
> > > > > >> >> > possible:
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >> "Right now we have RetriableException type, if we
> > are
> > > > > going
> > > > > >> to
> > > > > >> >> > add a
> > > > > >> >> > > > > >> `ProducerRetriableException` type, we have to put
> > > this new
> > > > > >> >> > interface
> > > > > >> >> > > > as
> > > > > >> >> > > > > >> the parent of the RetriableException, because not
> > all
> > > > > thrown
> > > > > >> >> > > non-fatal
> > > > > >> >> > > > > >> exceptions are `retriable` in general for
> producer"
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >> In the list of exceptions in your KIP, I found
> > > non-fatal
> > > > > >> >> > exceptions
> > > > > >> >> > > > that
> > > > > >> >> > > > > >> do not inherit from RetriableException. I guess
> > those
> > > are
> > > > > >> the
> > > > > >> >> ones
> > > > > >> >> > > you
> > > > > >> >> > > > > >> are referring to in your statement:
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >> InvalidProducerEpochException
> > > > > >> >> > > > > >> InvalidPidMappingException
> > > > > >> >> > > > > >> TransactionAbortedException
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >> All of those exceptions are non-fatal and do not
> > > inherit
> > > > > >> from
> > > > > >> >> > > > > >> RetriableException. Is there a reason for that? If
> > > they
> > > > > >> >> depended
> > > > > >> >> > > from
> > > > > >> >> > > > > >> RetriableException we would be a bit closer to a
> > super
> > > > > >> >> exception I
> > > > > >> >> > > > > >> mention above.
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >> The reason is that sender may catch those
> exceptions
> > > in
> > > > > the
> > > > > >> >> > > > > > ProduceResponse, and it currently does infinite
> > > > > >> >> > > > > > retries on RetriableException. To make sure we
> could
> > > > > trigger
> > > > > >> the
> > > > > >> >> > > > > > abortTransaction() by catching non-fatal thrown
> > > > > >> >> > > > > > exceptions and reinitialize the txn state, we chose
> > > not to
> > > > > >> let
> > > > > >> >> > those
> > > > > >> >> > > > > > exceptions inherit RetriableException so that
> > > > > >> >> > > > > > they won't cause infinite retry on sender.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > >
> > > > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > > > >> >> UnknownProducerIdException, I
> > > > > >> >> > > > think
> > > > > >> >> > > > > >> to understand that their fatality depends on the
> > type
> > > > > (i.e.
> > > > > >> >> > > > > >> configuration) of the producer. That makes it
> > > difficult to
> > > > > >> >> have a
> > > > > >> >> > > > super
> > > > > >> >> > > > > >> exception that encodes the retriability as
> mentioned
> > > > > above.
> > > > > >> >> Would
> > > > > >> >> > it
> > > > > >> >> > > > be
> > > > > >> >> > > > > >> possible to introduce exceptions that inherit from
> > > > > >> >> > > RetriableException
> > > > > >> >> > > > > >> and that are thrown when those exceptions are
> caught
> > > from
> > > > > >> the
> > > > > >> >> > > brokers
> > > > > >> >> > > > > >> and the type of the producer is such that the
> > > exceptions
> > > > > are
> > > > > >> >> > > > retriable?
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >> Yea, I think in general the exception type mixing
> is
> > > > > >> difficult
> > > > > >> >> to
> > > > > >> >> > > get
> > > > > >> >> > > > > > them all right. I have already proposed another
> > > solution
> > > > > >> based
> > > > > >> >> on
> > > > > >> >> > my
> > > > > >> >> > > > > > offline discussion with some folks working on EOS
> > > > > >> >> > > > > > to make the handling more straightforward for end
> > users
> > > > > >> without
> > > > > >> >> the
> > > > > >> >> > > > need
> > > > > >> >> > > > > > to distinguish exception fatality.
> > > > > >> >> > > > > >
> > > > > >> >> > > > > >> Best,
> > > > > >> >> > > > > >> Bruno
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > > >> >> > > > > >> > Thanks Boyang for the proposal! I made a pass
> over
> > > the
> > > > > >> list
> > > > > >> >> and
> > > > > >> >> > > here
> > > > > >> >> > > > > are
> > > > > >> >> > > > > >> > some thoughts:
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> > 0) Although this is not part of the public API,
> I
> > > think
> > > > > we
> > > > > >> >> > should
> > > > > >> >> > > > make
> > > > > >> >> > > > > >> sure
> > > > > >> >> > > > > >> > that the suggested pattern (i.e. user can always
> > > call
> > > > > >> >> abortTxn()
> > > > > >> >> > > > when
> > > > > >> >> > > > > >> > handling non-fatal errors) are indeed supported.
> > > E.g. if
> > > > > >> the
> > > > > >> >> txn
> > > > > >> >> > > is
> > > > > >> >> > > > > >> already
> > > > > >> >> > > > > >> > aborted by the broker side, then users can still
> > > call
> > > > > >> >> abortTxn
> > > > > >> >> > > which
> > > > > >> >> > > > > >> would
> > > > > >> >> > > > > >> > not throw another exception but just be treated
> > as a
> > > > > >> no-op.
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> > 1) *ConcurrentTransactionsException*: I think
> this
> > > error
> > > > > >> can
> > > > > >> >> > also
> > > > > >> >> > > be
> > > > > >> >> > > > > >> > returned but not documented yet. This should be
> a
> > > > > >> non-fatal
> > > > > >> >> > error.
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this error is
> > > returned
> > > > > from
> > > > > >> >> > broker
> > > > > >> >> > > > when
> > > > > >> >> > > > > >> txn
> > > > > >> >> > > > > >> > state transition failed (e.g. it is trying to
> > > transit to
> > > > > >> >> > > > > complete-commit
> > > > > >> >> > > > > >> > while the current state is not prepare-commit).
> > This
> > > > > error
> > > > > >> >> could
> > > > > >> >> > > > > >> indicates
> > > > > >> >> > > > > >> > a bug on the client internal code or the broker
> > > code,
> > > > > OR a
> > > > > >> >> user
> > > > > >> >> > > > error
> > > > > >> >> > > > > >> --- a
> > > > > >> >> > > > > >> > similar error is
> ConcurrentTransactionsException,
> > > i.e.
> > > > > if
> > > > > >> >> Kafka
> > > > > >> >> > is
> > > > > >> >> > > > > >> bug-free
> > > > > >> >> > > > > >> > these exceptions would only be returned if users
> > > try to
> > > > > do
> > > > > >> >> > > something
> > > > > >> >> > > > > >> wrong,
> > > > > >> >> > > > > >> > e.g. calling abortTxn right after a commitTxn,
> > etc.
> > > So
> > > > > I'm
> > > > > >> >> > > thinking
> > > > > >> >> > > > it
> > > > > >> >> > > > > >> > should be a non-fatal error instead of a fatal
> > > error,
> > > > > >> wdyt?
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> > 3) *KafkaException*: case i "indicates fatal
> > > > > transactional
> > > > > >> >> > > sequence
> > > > > >> >> > > > > >> > (Fatal)", this is a bit conflicting with the
> > > > > >> >> > > > *OutOfSequenceException*
> > > > > >> >> > > > > >> that
> > > > > >> >> > > > > >> > is treated as non-fatal. I guess your proposal
> is
> > > that
> > > > > >> >> > > > > >> > OutOfOrderSequenceException would be treated
> > either
> > > as
> > > > > >> fatal
> > > > > >> >> > with
> > > > > >> >> > > > > >> > transactional producer, or non-fatal with
> > idempotent
> > > > > >> >> producer,
> > > > > >> >> > is
> > > > > >> >> > > > that
> > > > > >> >> > > > > >> > right? If the producer is only configured with
> > > > > idempotency
> > > > > >> >> but
> > > > > >> >> > not
> > > > > >> >> > > > > >> > transaction, then throwing a
> > > > > >> >> TransactionStateCorruptedException
> > > > > >> >> > > for
> > > > > >> >> > > > > >> > non-fatal errors would be confusing since users
> > are
> > > not
> > > > > >> using
> > > > > >> >> > > > > >> transactions
> > > > > >> >> > > > > >> > at all.. So I suggest we always throw
> > > > > >> OutOfSequenceException
> > > > > >> >> > as-is
> > > > > >> >> > > > > (i.e.
> > > > > >> >> > > > > >> > not wrapped) no matter how the producer is
> > > configured,
> > > > > and
> > > > > >> >> let
> > > > > >> >> > the
> > > > > >> >> > > > > >> caller
> > > > > >> >> > > > > >> > decide how to handle it based on whether it is
> > only
> > > > > >> >> idempotent
> > > > > >> >> > or
> > > > > >> >> > > > > >> > transactional itself.
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> > 4) Besides all the txn APIs, the `send()`
> > callback /
> > > > > >> future
> > > > > >> >> can
> > > > > >> >> > > also
> > > > > >> >> > > > > >> throw
> > > > > >> >> > > > > >> > txn-related exceptions, I think this KIP should
> > also
> > > > > cover
> > > > > >> >> this
> > > > > >> >> > > API
> > > > > >> >> > > > as
> > > > > >> >> > > > > >> well?
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> > 5) This is related to 1/2) above: sometimes
> those
> > > > > >> non-fatal
> > > > > >> >> > errors
> > > > > >> >> > > > > like
> > > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not due to
> > the
> > > > > state
> > > > > >> >> being
> > > > > >> >> > > > > >> corrupted
> > > > > >> >> > > > > >> > at the broker side, but maybe users are doing
> > > something
> > > > > >> >> wrong.
> > > > > >> >> > So
> > > > > >> >> > > > I'm
> > > > > >> >> > > > > >> > wondering if we should further distinguish those
> > > > > non-fatal
> > > > > >> >> > errors
> > > > > >> >> > > > > >> between
> > > > > >> >> > > > > >> > a) those that are caused by Kafka itself, e.g. a
> > > broker
> > > > > >> timed
> > > > > >> >> > out
> > > > > >> >> > > > and
> > > > > >> >> > > > > >> > aborted a txn and later an endTxn request is
> > > received,
> > > > > >> and b)
> > > > > >> >> > the
> > > > > >> >> > > > > user's
> > > > > >> >> > > > > >> > API call pattern is incorrect, causing the
> request
> > > to be
> > > > > >> >> > rejected
> > > > > >> >> > > > with
> > > > > >> >> > > > > >> an
> > > > > >> >> > > > > >> > error code from the broker.
> > > > > >> >> *TransactionStateCorruptedException*
> > > > > >> >> > > > feels
> > > > > >> >> > > > > >> to
> > > > > >> >> > > > > >> > me more like for case a), but not case b).
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> > Guozhang
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > > >> >> > > > > >> > wrote:
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> >> Thanks Matthias, I think your proposal makes
> > sense
> > > as
> > > > > >> well,
> > > > > >> >> on
> > > > > >> >> > > the
> > > > > >> >> > > > > pro
> > > > > >> >> > > > > >> side
> > > > > >> >> > > > > >> >> we could have a universally agreed exception
> type
> > > to be
> > > > > >> >> caught
> > > > > >> >> > by
> > > > > >> >> > > > the
> > > > > >> >> > > > > >> user,
> > > > > >> >> > > > > >> >> without having an extra layer on top of the
> > actual
> > > > > >> >> exceptions.
> > > > > >> >> > I
> > > > > >> >> > > > > could
> > > > > >> >> > > > > >> see
> > > > > >> >> > > > > >> >> some issue on downsides:
> > > > > >> >> > > > > >> >>
> > > > > >> >> > > > > >> >> 1. The exception hierarchy will be more
> complex.
> > > Right
> > > > > >> now
> > > > > >> >> we
> > > > > >> >> > > have
> > > > > >> >> > > > > >> >> RetriableException type, if we are going to
> add a
> > > > > >> >> > > > > >> >> `ProducerRetriableException` type, we have to
> put
> > > this
> > > > > >> new
> > > > > >> >> > > > interface
> > > > > >> >> > > > > >> as the
> > > > > >> >> > > > > >> >> parent of the RetriableException, because not
> all
> > > > > thrown
> > > > > >> >> > > non-fatal
> > > > > >> >> > > > > >> >> exceptions are `retriable` in general for
> > > producer, for
> > > > > >> >> example
> > > > > >> >> > > > > >> >> <
> > > > > >> >> > > > > >> >>
> > > > > >> >> > > > > >>
> > > > > >> >> > > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > >> >> > > > > >> >>> .
> > > > > >> >> > > > > >> >> We could have an empty interface
> > > > > >> >> `ProducerRetriableException`
> > > > > >> >> > to
> > > > > >> >> > > > let
> > > > > >> >> > > > > >> all
> > > > > >> >> > > > > >> >> the thrown exceptions implement for sure, but
> > it's
> > > a
> > > > > bit
> > > > > >> >> > > unorthodox
> > > > > >> >> > > > > in
> > > > > >> >> > > > > >> the
> > > > > >> >> > > > > >> >> way we deal with exceptions in general.
> > > > > >> >> > > > > >> >>
> > > > > >> >> > > > > >> >> 2. There are cases where we throw a
> > KafkaException
> > > > > >> wrapping
> > > > > >> >> > > another
> > > > > >> >> > > > > >> >> KafkaException as either fatal or non-fatal. If
> > we
> > > use
> > > > > an
> > > > > >> >> > > interface
> > > > > >> >> > > > > to
> > > > > >> >> > > > > >> >> solve #1, it is also required to implement
> > another
> > > > > >> bloated
> > > > > >> >> > > > exception
> > > > > >> >> > > > > >> class
> > > > > >> >> > > > > >> >> which could replace KafkaException type, as we
> > > couldn't
> > > > > >> mark
> > > > > >> >> > > > > >> KafkaException
> > > > > >> >> > > > > >> >> as retriable for sure.
> > > > > >> >> > > > > >> >>
> > > > > >> >> > > > > >> >> 3. In terms of the encapsulation, wrapping
> means
> > we
> > > > > could
> > > > > >> >> limit
> > > > > >> >> > > the
> > > > > >> >> > > > > >> scope
> > > > > >> >> > > > > >> >> of affection to the producer only, which is
> > > important
> > > > > >> since
> > > > > >> >> we
> > > > > >> >> > > > don't
> > > > > >> >> > > > > >> want
> > > > > >> >> > > > > >> >> shared exception types to implement a
> > > producer-related
> > > > > >> >> > interface,
> > > > > >> >> > > > > such
> > > > > >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> > > > > >> >> > > > > >> >>
> > > > > >> >> > > > > >> >> Best,
> > > > > >> >> > > > > >> >> Boyang
> > > > > >> >> > > > > >> >>
> > > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax
> <
> > > > > >> >> > mjsax@apache.org
> > > > > >> >> > > >
> > > > > >> >> > > > > >> wrote:
> > > > > >> >> > > > > >> >>
> > > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > >> >> > > > > >> >>>
> > > > > >> >> > > > > >> >>> Overall, categorizing exceptions makes a lot
> of
> > > sense.
> > > > > >> As I
> > > > > >> >> > > don't
> > > > > >> >> > > > > know
> > > > > >> >> > > > > >> >>> the producer internals well enough, I cannot
> > > comment
> > > > > on
> > > > > >> the
> > > > > >> >> > > > > >> >>> categorization in detail though.
> > > > > >> >> > > > > >> >>>
> > > > > >> >> > > > > >> >>> What I am wondering is, if we should introduce
> > an
> > > > > >> exception
> > > > > >> >> > > > > interface
> > > > > >> >> > > > > >> >>> that non-fatal exception would implement
> instead
> > > of
> > > > > >> >> creating a
> > > > > >> >> > > new
> > > > > >> >> > > > > >> class
> > > > > >> >> > > > > >> >>> that will wrap non-fatal exceptions? What
> would
> > > be the
> > > > > >> >> > pros/cons
> > > > > >> >> > > > for
> > > > > >> >> > > > > >> >>> both designs?
> > > > > >> >> > > > > >> >>>
> > > > > >> >> > > > > >> >>>
> > > > > >> >> > > > > >> >>> -Matthias
> > > > > >> >> > > > > >> >>>
> > > > > >> >> > > > > >> >>>
> > > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > > > >> >> > > > > >> >>>> Hey there,
> > > > > >> >> > > > > >> >>>>
> > > > > >> >> > > > > >> >>>> I would like to start a discussion thread for
> > > > > KIP-691:
> > > > > >> >> > > > > >> >>>>
> > > > > >> >> > > > > >> >>>
> > > > > >> >> > > > > >> >>
> > > > > >> >> > > > > >>
> > > > > >> >> > > > >
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >>
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > >> >> > > > > >> >>>>
> > > > > >> >> > > > > >> >>>> The KIP is aiming to simplify the exception
> > > handling
> > > > > >> logic
> > > > > >> >> > for
> > > > > >> >> > > > > >> >>>> transactional Producer users by classifying
> > > fatal and
> > > > > >> >> > non-fatal
> > > > > >> >> > > > > >> >>> exceptions
> > > > > >> >> > > > > >> >>>> and throw them correspondingly for easier
> catch
> > > and
> > > > > >> retry.
> > > > > >> >> > Let
> > > > > >> >> > > me
> > > > > >> >> > > > > >> know
> > > > > >> >> > > > > >> >>> what
> > > > > >> >> > > > > >> >>>> you think.
> > > > > >> >> > > > > >> >>>>
> > > > > >> >> > > > > >> >>>> Best,
> > > > > >> >> > > > > >> >>>> Boyang
> > > > > >> >> > > > > >> >>>>
> > > > > >> >> > > > > >> >>>
> > > > > >> >> > > > > >> >>
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >> >
> > > > > >> >> > > > > >>
> > > > > >> >> > > > > >
> > > > > >> >> > > > >
> > > > > >> >> > > >
> > > > > >> >> > > >
> > > > > >> >> > > > --
> > > > > >> >> > > > -- Guozhang
> > > > > >> >> > > >
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >> >
> > > > > >> >> > --
> > > > > >> >> > -- Guozhang
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > -- Guozhang
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> -- Guozhang
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > >
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Thanks Jason for the thoughts.

On Wed, Jan 27, 2021 at 11:52 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hi Boyang,
>
> Thanks for the iterations here. I think this is something we should have
> done a long time ago. It sounds like there are two API changes here:
>
> 1. We are introducing the `CommitFailedException` to wrap abortable
> errors that are raised from `commitTransaction`. This sounds fine to me. As
> far as I know, the only case we might need this is when we add support to
> let producers recover from coordinator timeouts. Are there any others?
>
> I think the purpose here is to ensure non-fatal exceptions are unified
under the same
exception umbrella, to make the proceeding to abort any ongoing transaction
much easier.
I don't think `coordinator timeouts` is the only case to recover here,
since we have other
non-fatal exceptions such as UnknownPid.

2. We are wrapping non-fatal errors raised from `send` as `KafkaException`.
> The motivation for this is less clear to me and it doesn't look like the
> example from the KIP depends on it. My concern here is compatibility.
> Currently we have the following documentation for the `Callback` API:
>
> ```
>      *                  Non-Retriable exceptions (fatal, the message will
> never be sent):
>      *
>      *                  InvalidTopicException
>      *                  OffsetMetadataTooLargeException
>      *                  RecordBatchTooLargeException
>      *                  RecordTooLargeException
>      *                  UnknownServerException
>      *                  UnknownProducerIdException
>      *                  InvalidProducerEpochException
>      *
>      *                  Retriable exceptions (transient, may be covered by
> increasing #.retries):
>      *
>      *                  CorruptRecordException
>      *                  InvalidMetadataException
>      *                  NotEnoughReplicasAfterAppendException
>      *                  NotEnoughReplicasException
>      *                  OffsetOutOfRangeException
>      *                  TimeoutException
>      *                  UnknownTopicOrPartitionException
> ```
>
> If we wrap all the retriable exceptions documented here as
> `KafkaException`, wouldn't that break any error handling that users might
> already have? it's gonna introduce a compatibility issue.
>
> The original intention was to simplify `send` callback error handling by
doing exception wrapping, as on Streams level
we have to prepare an exhausting list of exceptions to catch as fatal, and
the same lengthy list to catch as
non-fatal. It would be much easier if we got `hints` from the callback.
However,
I agree there is a compatibility concern, what about deprecating the
existing:

void onCompletion(RecordMetadata metadata, Exception exception)

and replace it with:

default void onCompletion(RecordMetadata metadata, Exception exception,
boolean isFatal) {
  this.onCompletion(metadata, exception);
}

to make sure new users get the benefit of understanding the fatality based
on the info presented by the producer?

Thanks,
> Jason
>
>
> On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <in...@hiringuru.com> wrote:
>
> > Why  we are receiving all emails kindly remove us from
> > dev@kafka.apache.org we don't want to receive emails anymore.
> >
> > Thanks
> > > On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com> wrote:
> > >
> > >
> > > Thanks Boyang, yes I think I was confused about the different handling
> of
> > > two abortTxn calls, and now I get it was not intentional. I think I do
> > not
> > > have more concerns.
> > >
> > > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Thanks for the clarification Guozhang, I got your point that we want
> to
> > > > have a consistent handling of fatal exceptions being thrown from the
> > > > abortTxn. I modified the current template to move the fatal exception
> > > > try-catch outside of the processing loop to make sure we could get a
> > chance
> > > > to close consumer/producer modules. Let me know what you think.
> > > >
> > > > Best,
> > > > Boyang
> > > >
> > > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > My understanding is that abortTransaction would only throw when the
> > > > > producer is in fatal state. For CommitFailed, the producer should
> > still
> > > > be
> > > > > in the abortable error state, so that abortTransaction call would
> not
> > > > throw.
> > > > >
> > > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <wangguoz@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> Or are you going to maintain some internal state such that, the
> > > > >> `abortTransaction` in the catch block would never throw again?
> > > > >>
> > > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <
> wangguoz@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> > Hi Boyang/Jason,
> > > > >> >
> > > > >> > I've also thought about this (i.e. using CommitFailed for all
> > > > >> non-fatal),
> > > > >> > but what I'm pondering is that, in the catch (CommitFailed)
> block,
> > > > what
> > > > >> > would happen if the `producer.abortTransaction();` throws again?
> > > > should
> > > > >> > that be captured as a fatal and cause the client to close again.
> > > > >> >
> > > > >> > If yes, then naively the pattern would be:
> > > > >> >
> > > > >> > ...
> > > > >> > catch (CommitFailedException e) {
> > > > >> >         // Transaction commit failed with abortable error, user
> > could
> > > > >> reset
> > > > >> >         // the application state and resume with a new
> > transaction.
> > > > The
> > > > >> > root
> > > > >> >         // cause was wrapped in the thrown exception.
> > > > >> >         resetToLastCommittedPositions(consumer);
> > > > >> >         try {
> > > > >> >             producer.abortTransaction();
> > > > >> >         } catch (KafkaException e) {
> > > > >> >             producer.close();
> > > > >> >             consumer.close();
> > > > >> >             throw e;
> > > > >> >         }
> > > > >> >     } catch (KafkaException e) {
> > > > >> >         producer.close();
> > > > >> >         consumer.close();
> > > > >> >         throw e;
> > > > >> >     }
> > > > >> > ...
> > > > >> >
> > > > >> > Guozhang
> > > > >> >
> > > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > > >> reluctanthero104@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> >> Hey Guozhang,
> > > > >> >>
> > > > >> >> Jason and I were discussing the new API offline and decided to
> > take
> > > > >> >> another
> > > > >> >> approach. Firstly, the reason not to invent a new API with
> > returned
> > > > >> >> boolean
> > > > >> >> flag is for compatibility consideration, since old EOS code
> > would not
> > > > >> know
> > > > >> >> that a given transaction commit was failed internally as they
> > don't
> > > > >> listen
> > > > >> >> to the output flag. Our proposed alternative solution is to let
> > > > >> >> *commitTransaction
> > > > >> >> throw CommitFailedException whenever the commit failed with
> > non-fatal
> > > > >> >> exception*, so that on the caller side the handling logic
> > becomes:
> > > > >> >>
> > > > >> >> try {
> > > > >> >>         if (shouldCommit) {
> > > > >> >>             producer.commitTransaction();
> > > > >> >>         } else {
> > > > >> >>             resetToLastCommittedPositions(consumer);
> > > > >> >>             producer.abortTransaction();
> > > > >> >>         }
> > > > >> >>     } catch (CommitFailedException e) {
> > > > >> >>         // Transaction commit failed with abortable error, user
> > could
> > > > >> >> reset
> > > > >> >>         // the application state and resume with a new
> > transaction.
> > > > The
> > > > >> >> root
> > > > >> >>         // cause was wrapped in the thrown exception.
> > > > >> >>         resetToLastCommittedPositions(consumer);
> > > > >> >>         producer.abortTransaction();
> > > > >> >>     } catch (KafkaException e) {
> > > > >> >>         producer.close();
> > > > >> >>         consumer.close();
> > > > >> >>         throw e;
> > > > >> >>     }
> > > > >> >>
> > > > >> >> This approach looks cleaner as all exception types other than
> > > > >> CommitFailed
> > > > >> >> will doom to be fatal, which is very easy to adopt for users.
> In
> > the
> > > > >> >> meantime, we still maintain the commitTxn behavior to throw
> > instead
> > > > of
> > > > >> >> silently failing.
> > > > >> >>
> > > > >> >> In addition, we decided to drop the recommendation to handle
> > > > >> >> TimeoutException and leave it to the users to make the call.
> The
> > > > >> downside
> > > > >> >> for blindly calling abortTxn upon timeout is that we could
> > result in
> > > > an
> > > > >> >> illegal state when the commit was already successful on the
> > broker
> > > > >> >> side. Without a good guarantee on the outcome, overcomplicating
> > the
> > > > >> >> template should not be encouraged IMHO.
> > > > >> >>
> > > > >> >> Let me know your thoughts on the new approach here, thank you!
> > > > >> >>
> > > > >> >> Best,
> > > > >> >> Boyang
> > > > >> >>
> > > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <
> > wangguoz@gmail.com>
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > Thanks for your clarification on 2)/3), that makes sense.
> > > > >> >> >
> > > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > > > >> >> reluctanthero104@gmail.com>
> > > > >> >> > wrote:
> > > > >> >> >
> > > > >> >> > > Thanks for the input Guozhang, replied inline.
> > > > >> >> > >
> > > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> > > > wangguoz@gmail.com>
> > > > >> >> > wrote:
> > > > >> >> > >
> > > > >> >> > > > Hello Boyang,
> > > > >> >> > > >
> > > > >> >> > > > Thanks for the updated KIP. I read it again and have the
> > > > >> following
> > > > >> >> > > > thoughts:
> > > > >> >> > > >
> > > > >> >> > > > 0. I'm a bit concerned that if commitTxn does not throw
> any
> > > > >> >> non-fatal
> > > > >> >> > > > exception, and instead we rely on the subsequent beginTxn
> > call
> > > > to
> > > > >> >> > throw,
> > > > >> >> > > it
> > > > >> >> > > > may violate some callers with a pattern that relying on
> > > > >> commitTxn to
> > > > >> >> > > > succeed to make some non-rollback operations. For
> example:
> > > > >> >> > > >
> > > > >> >> > > > beginTxn()
> > > > >> >> > > > // do some read-write on my local DB
> > > > >> >> > > > commitTxn()
> > > > >> >> > > > // if commitTxn succeeds, then commit the DB
> > > > >> >> > > >
> > > > >> >> > > > -------------
> > > > >> >> > > >
> > > > >> >> > > > The issue is that, committing DB is a non-rollback
> > operation,
> > > > and
> > > > >> >> users
> > > > >> >> > > may
> > > > >> >> > > > just rely on commitTxn to return without error to make
> this
> > > > >> >> > non-rollback
> > > > >> >> > > > call. Of course we can just claim this pattern is not
> > > > legitimate
> > > > >> >> and is
> > > > >> >> > > not
> > > > >> >> > > > the right way of doing things, but many users may
> naturally
> > > > adopt
> > > > >> >> this
> > > > >> >> > > > pattern.
> > > > >> >> > > >
> > > > >> >> > > > So maybe we should still let commitTxn also throw
> non-fatal
> > > > >> >> exceptions,
> > > > >> >> > > in
> > > > >> >> > > > which case we would then call abortTxn again.
> > > > >> >> > > >
> > > > >> >> > > > But if we do this, the pattern becomes:
> > > > >> >> > > >
> > > > >> >> > > > try {
> > > > >> >> > > >    beginTxn()
> > > > >> >> > > >    // do something
> > > > >> >> > > > } catch (Exception) {
> > > > >> >> > > >    shouldCommit = false;
> > > > >> >> > > > }
> > > > >> >> > > >
> > > > >> >> > > > if (shouldCommit) {
> > > > >> >> > > >     try {
> > > > >> >> > > >         commitTxn()
> > > > >> >> > > >     } catch (...) {        // enumerate all fatal
> > exceptions
> > > > >> >> > > >         shutdown()
> > > > >> >> > > >     } catch (KafkaException) {
> > > > >> >> > > >         // non-fatal
> > > > >> >> > > >         shouldCommit = false;
> > > > >> >> > > >     }
> > > > >> >> > > > }
> > > > >> >> > > >
> > > > >> >> > > > if (!shouldCommit && running()) {
> > > > >> >> > > > try {
> > > > >> >> > > >         abortTxn()
> > > > >> >> > > >     } catch (KafkaException) {
> > > > >> >> > > >         // only throw fatal
> > > > >> >> > > >         shutdown()
> > > > >> >> > > >     }
> > > > >> >> > > > }
> > > > >> >> > > >
> > > > >> >> > > > ---------------------
> > > > >> >> > > >
> > > > >> >> > > > Which is much more complicated.
> > > > >> >> > > >
> > > > >> >> > > > Thank makes me think, the alternative we have discussed
> > offline
> > > > >> may
> > > > >> >> be
> > > > >> >> > > > better: let commitTxn() to return a boolean flag.
> > > > >> >> > > >
> > > > >> >> > > > * If it returns true, it means the commit succeeded.
> Users
> > can
> > > > >> >> > > comfortably
> > > > >> >> > > > continue and do any external non-rollback operations if
> > they
> > > > >> like.
> > > > >> >> > > > * If it returns false, it means the commit failed with
> > > > non-fatal
> > > > >> >> error
> > > > >> >> > > *AND
> > > > >> >> > > > the txn has been aborted*. Users do not need to call
> > abortTxn
> > > > >> again.
> > > > >> >> > > > * If it throws, then it means fatal errors. Users should
> > shut
> > > > >> down
> > > > >> >> the
> > > > >> >> > > > client.
> > > > >> >> > > >
> > > > >> >> > > > In this case, the pattern becomes:
> > > > >> >> > > >
> > > > >> >> > > > try {
> > > > >> >> > > >    beginTxn()
> > > > >> >> > > >    // do something
> > > > >> >> > > > } catch (Exception) {
> > > > >> >> > > >    shouldCommit = false;
> > > > >> >> > > > }
> > > > >> >> > > >
> > > > >> >> > > > try {
> > > > >> >> > > >     if (shouldCommit) {
> > > > >> >> > > >         commitSucceeded = commitTxn()
> > > > >> >> > > >     } else {
> > > > >> >> > > >         // reset offsets, rollback operations, etc
> > > > >> >> > > >         abortTxn()
> > > > >> >> > > >     }
> > > > >> >> > > > } catch (KafkaException) {
> > > > >> >> > > >     // only throw fatal
> > > > >> >> > > >     shutdown()
> > > > >> >> > > > }
> > > > >> >> > > >
> > > > >> >> > > > if (commitSucceeded)
> > > > >> >> > > >    // do other non-rollbackable things
> > > > >> >> > > > else
> > > > >> >> > > >    // reset offsets, rollback operations, etc
> > > > >> >> > > >
> > > > >> >> > > > -------------------------
> > > > >> >> > > >
> > > > >> >> > > > Of course, if we want to have better visibility into what
> > > > caused
> > > > >> the
> > > > >> >> > > commit
> > > > >> >> > > > to fail and txn to abort. We can let the return type be
> an
> > enum
> > > > >> >> instead
> > > > >> >> > > of
> > > > >> >> > > > a flag. But my main idea is to still let the commitTxn be
> > the
> > > > >> final
> > > > >> >> > point
> > > > >> >> > > > users can tell whether this txn succeeded or not, instead
> > of
> > > > >> >> relying on
> > > > >> >> > > the
> > > > >> >> > > > next beginTxn() call.
> > > > >> >> > > >
> > > > >> >> > > > I agree that adding a boolean flag is indeed useful in
> this
> > > > case.
> > > > >> >> Will
> > > > >> >> > > update the KIP.
> > > > >> >> > >
> > > > >> >> > > 1. Re: "while maintaining the behavior to throw fatal
> > exception
> > > > in
> > > > >> >> raw"
> > > > >> >> > not
> > > > >> >> > > > sure what do you mean by "throw" here. Are you proposing
> > the
> > > > >> >> callback
> > > > >> >> > > would
> > > > >> >> > > > *pass* (not throw) in any fatal exceptions when triggered
> > > > without
> > > > >> >> > > wrapping?
> > > > >> >> > > >
> > > > >> >> > > > Yes, I want to say *pass*, the benefit is to make the end
> > > > user's
> > > > >> >> > > expectation consistent
> > > > >> >> > > regarding exception handling.
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > > 2. I'm not sure I understand the claim regarding the
> > callback
> > > > >> that
> > > > >> >> "In
> > > > >> >> > > EOS
> > > > >> >> > > > setup, it is not required to handle the exception". Are
> you
> > > > >> >> proposing
> > > > >> >> > > that,
> > > > >> >> > > > e.g. in Streams, we do not try to handle any exceptions
> if
> > EOS
> > > > is
> > > > >> >> > enabled
> > > > >> >> > > > in the callback anymore, but just let commitTxn() itself
> to
> > > > fail
> > > > >> to
> > > > >> >> be
> > > > >> >> > > > notified about the problem? Personally I think in Streams
> > we
> > > > >> should
> > > > >> >> > just
> > > > >> >> > > > make the handling logic of the callback to be consistent
> > > > >> regardless
> > > > >> >> of
> > > > >> >> > > the
> > > > >> >> > > > EOS settings (today we have different logic depending on
> > this
> > > > >> logic,
> > > > >> >> > > which
> > > > >> >> > > > I think could be unified as well).
> > > > >> >> > > >
> > > > >> >> > > > My idea originates from the claim on send API:
> > > > >> >> > > "When used as part of a transaction, it is not necessary to
> > > > define
> > > > >> a
> > > > >> >> > > callback or check the result of the future  in order to
> > detect
> > > > >> errors
> > > > >> >> > from
> > > > >> >> > > <code>send</code>. "
> > > > >> >> > > My understanding is that for EOS, the exception will be
> > detected
> > > > by
> > > > >> >> > calling
> > > > >> >> > > transactional APIs either way, so it is a duplicate
> handling
> > to
> > > > >> track
> > > > >> >> > > all the sendExceptions in RecordCollector. However, I
> looked
> > up
> > > > >> >> > > sendException is being used today as follow:
> > > > >> >> > >
> > > > >> >> > > 1. Pass to "ProductionExceptionHandler" for any default or
> > > > >> customized
> > > > >> >> > > exception handler to handle
> > > > >> >> > > 2. Stop collecting offset info or new exceptions
> > > > >> >> > > 3. Check and rethrow exceptions in close(), flush() or new
> > send()
> > > > >> >> calls
> > > > >> >> > >
> > > > >> >> > > To my understanding, we should still honor the commitment
> to
> > #1
> > > > for
> > > > >> >> any
> > > > >> >> > > user customized implementation. The #2 does not really
> > affect our
> > > > >> >> > decision
> > > > >> >> > > upon EOS. The #3 here is still valuable as it could help us
> > fail
> > > > >> fast
> > > > >> >> in
> > > > >> >> > > new send() instead of waiting to later stage of processing.
> > In
> > > > that
> > > > >> >> > sense,
> > > > >> >> > > I agree we should continue to record send exceptions even
> > under
> > > > EOS
> > > > >> >> case
> > > > >> >> > to
> > > > >> >> > > ensure the strength of stream side Producer logic. On the
> > safer
> > > > >> side,
> > > > >> >> we
> > > > >> >> > no
> > > > >> >> > > longer need to wrap certain fatal exceptions like
> > ProducerFenced
> > > > as
> > > > >> >> > > TaskMigrated, since they should not crash the stream thread
> > if
> > > > >> thrown
> > > > >> >> in
> > > > >> >> > > raw format, once we adopt the new processing model in the
> > send
> > > > >> phase.
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > >
> > > > >> >> > > > Guozhang
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > > > >> >> > reluctanthero104@gmail.com>
> > > > >> >> > > > wrote:
> > > > >> >> > > >
> > > > >> >> > > > > Thanks for everyone's feedback so far. I have polished
> > the
> > > > KIP
> > > > >> >> after
> > > > >> >> > > > > offline discussion with some folks working on EOS to
> > make the
> > > > >> >> > exception
> > > > >> >> > > > > handling more lightweight. The essential change is that
> > we
> > > > are
> > > > >> not
> > > > >> >> > > > > inventing a new intermediate exception type, but
> instead
> > > > >> >> separating a
> > > > >> >> > > > full
> > > > >> >> > > > > transaction into two phases:
> > > > >> >> > > > >
> > > > >> >> > > > > 1. The data transmission phase
> > > > >> >> > > > > 2. The commit phase
> > > > >> >> > > > >
> > > > >> >> > > > > This way for any exception thrown from phase 1, will be
> > an
> > > > >> >> indicator
> > > > >> >> > in
> > > > >> >> > > > > phase 2 whether we should do commit or abort, and from
> > now on
> > > > >> >> > > > > `commitTransaction` should only throw fatal exceptions,
> > > > >> similar to
> > > > >> >> > > > > `abortTransaction`, so that any KafkaException caught
> in
> > the
> > > > >> >> commit
> > > > >> >> > > phase
> > > > >> >> > > > > will be definitely fatal to crash the app. For more
> > advanced
> > > > >> users
> > > > >> >> > such
> > > > >> >> > > > as
> > > > >> >> > > > > Streams, we have the ability to further wrap selected
> > types
> > > > of
> > > > >> >> fatal
> > > > >> >> > > > > exceptions to trigger task migration if necessary.
> > > > >> >> > > > >
> > > > >> >> > > > > More details in the KIP, feel free to take another
> look,
> > > > >> thanks!
> > > > >> >> > > > >
> > > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > > > >> >> > > reluctanthero104@gmail.com>
> > > > >> >> > > > > wrote:
> > > > >> >> > > > >
> > > > >> >> > > > > > Thanks Bruno for the feedback.
> > > > >> >> > > > > >
> > > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
> > > > >> >> bruno@confluent.io>
> > > > >> >> > > > wrote:
> > > > >> >> > > > > >
> > > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > > >> >> > > > > >>
> > > > >> >> > > > > >> Like Matthias, I do also not know the producer
> > internal
> > > > well
> > > > >> >> > enough
> > > > >> >> > > to
> > > > >> >> > > > > >> comment on the categorization. However, I think
> > having a
> > > > >> super
> > > > >> >> > > > exception
> > > > >> >> > > > > >> (e.g. RetriableException) that  encodes if an
> > exception is
> > > > >> >> fatal
> > > > >> >> > or
> > > > >> >> > > > not
> > > > >> >> > > > > >> is cleaner, better understandable and less
> > error-prone,
> > > > >> because
> > > > >> >> > > > ideally
> > > > >> >> > > > > >> when you add a new non-fatal exception in future you
> > just
> > > > >> need
> > > > >> >> to
> > > > >> >> > > > think
> > > > >> >> > > > > >> about letting it inherit from the super exception
> and
> > all
> > > > >> the
> > > > >> >> rest
> > > > >> >> > > of
> > > > >> >> > > > > >> the code will just behave correctly without the need
> > to
> > > > wrap
> > > > >> >> the
> > > > >> >> > new
> > > > >> >> > > > > >> exception into another exception each time it is
> > thrown
> > > > >> (maybe
> > > > >> >> it
> > > > >> >> > is
> > > > >> >> > > > > >> thrown at different location in the code).
> > > > >> >> > > > > >>
> > > > >> >> > > > > >> As far as I understand the following statement from
> > your
> > > > >> >> previous
> > > > >> >> > > > e-mail
> > > > >> >> > > > > >> is the reason that currently such a super exception
> > is not
> > > > >> >> > possible:
> > > > >> >> > > > > >>
> > > > >> >> > > > > >> "Right now we have RetriableException type, if we
> are
> > > > going
> > > > >> to
> > > > >> >> > add a
> > > > >> >> > > > > >> `ProducerRetriableException` type, we have to put
> > this new
> > > > >> >> > interface
> > > > >> >> > > > as
> > > > >> >> > > > > >> the parent of the RetriableException, because not
> all
> > > > thrown
> > > > >> >> > > non-fatal
> > > > >> >> > > > > >> exceptions are `retriable` in general for producer"
> > > > >> >> > > > > >>
> > > > >> >> > > > > >>
> > > > >> >> > > > > >> In the list of exceptions in your KIP, I found
> > non-fatal
> > > > >> >> > exceptions
> > > > >> >> > > > that
> > > > >> >> > > > > >> do not inherit from RetriableException. I guess
> those
> > are
> > > > >> the
> > > > >> >> ones
> > > > >> >> > > you
> > > > >> >> > > > > >> are referring to in your statement:
> > > > >> >> > > > > >>
> > > > >> >> > > > > >> InvalidProducerEpochException
> > > > >> >> > > > > >> InvalidPidMappingException
> > > > >> >> > > > > >> TransactionAbortedException
> > > > >> >> > > > > >>
> > > > >> >> > > > > >> All of those exceptions are non-fatal and do not
> > inherit
> > > > >> from
> > > > >> >> > > > > >> RetriableException. Is there a reason for that? If
> > they
> > > > >> >> depended
> > > > >> >> > > from
> > > > >> >> > > > > >> RetriableException we would be a bit closer to a
> super
> > > > >> >> exception I
> > > > >> >> > > > > >> mention above.
> > > > >> >> > > > > >>
> > > > >> >> > > > > >> The reason is that sender may catch those exceptions
> > in
> > > > the
> > > > >> >> > > > > > ProduceResponse, and it currently does infinite
> > > > >> >> > > > > > retries on RetriableException. To make sure we could
> > > > trigger
> > > > >> the
> > > > >> >> > > > > > abortTransaction() by catching non-fatal thrown
> > > > >> >> > > > > > exceptions and reinitialize the txn state, we chose
> > not to
> > > > >> let
> > > > >> >> > those
> > > > >> >> > > > > > exceptions inherit RetriableException so that
> > > > >> >> > > > > > they won't cause infinite retry on sender.
> > > > >> >> > > > > >
> > > > >> >> > > > > >
> > > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > > >> >> UnknownProducerIdException, I
> > > > >> >> > > > think
> > > > >> >> > > > > >> to understand that their fatality depends on the
> type
> > > > (i.e.
> > > > >> >> > > > > >> configuration) of the producer. That makes it
> > difficult to
> > > > >> >> have a
> > > > >> >> > > > super
> > > > >> >> > > > > >> exception that encodes the retriability as mentioned
> > > > above.
> > > > >> >> Would
> > > > >> >> > it
> > > > >> >> > > > be
> > > > >> >> > > > > >> possible to introduce exceptions that inherit from
> > > > >> >> > > RetriableException
> > > > >> >> > > > > >> and that are thrown when those exceptions are caught
> > from
> > > > >> the
> > > > >> >> > > brokers
> > > > >> >> > > > > >> and the type of the producer is such that the
> > exceptions
> > > > are
> > > > >> >> > > > retriable?
> > > > >> >> > > > > >>
> > > > >> >> > > > > >> Yea, I think in general the exception type mixing is
> > > > >> difficult
> > > > >> >> to
> > > > >> >> > > get
> > > > >> >> > > > > > them all right. I have already proposed another
> > solution
> > > > >> based
> > > > >> >> on
> > > > >> >> > my
> > > > >> >> > > > > > offline discussion with some folks working on EOS
> > > > >> >> > > > > > to make the handling more straightforward for end
> users
> > > > >> without
> > > > >> >> the
> > > > >> >> > > > need
> > > > >> >> > > > > > to distinguish exception fatality.
> > > > >> >> > > > > >
> > > > >> >> > > > > >> Best,
> > > > >> >> > > > > >> Bruno
> > > > >> >> > > > > >>
> > > > >> >> > > > > >>
> > > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > >> >> > > > > >> > Thanks Boyang for the proposal! I made a pass over
> > the
> > > > >> list
> > > > >> >> and
> > > > >> >> > > here
> > > > >> >> > > > > are
> > > > >> >> > > > > >> > some thoughts:
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> > 0) Although this is not part of the public API, I
> > think
> > > > we
> > > > >> >> > should
> > > > >> >> > > > make
> > > > >> >> > > > > >> sure
> > > > >> >> > > > > >> > that the suggested pattern (i.e. user can always
> > call
> > > > >> >> abortTxn()
> > > > >> >> > > > when
> > > > >> >> > > > > >> > handling non-fatal errors) are indeed supported.
> > E.g. if
> > > > >> the
> > > > >> >> txn
> > > > >> >> > > is
> > > > >> >> > > > > >> already
> > > > >> >> > > > > >> > aborted by the broker side, then users can still
> > call
> > > > >> >> abortTxn
> > > > >> >> > > which
> > > > >> >> > > > > >> would
> > > > >> >> > > > > >> > not throw another exception but just be treated
> as a
> > > > >> no-op.
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> > 1) *ConcurrentTransactionsException*: I think this
> > error
> > > > >> can
> > > > >> >> > also
> > > > >> >> > > be
> > > > >> >> > > > > >> > returned but not documented yet. This should be a
> > > > >> non-fatal
> > > > >> >> > error.
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this error is
> > returned
> > > > from
> > > > >> >> > broker
> > > > >> >> > > > when
> > > > >> >> > > > > >> txn
> > > > >> >> > > > > >> > state transition failed (e.g. it is trying to
> > transit to
> > > > >> >> > > > > complete-commit
> > > > >> >> > > > > >> > while the current state is not prepare-commit).
> This
> > > > error
> > > > >> >> could
> > > > >> >> > > > > >> indicates
> > > > >> >> > > > > >> > a bug on the client internal code or the broker
> > code,
> > > > OR a
> > > > >> >> user
> > > > >> >> > > > error
> > > > >> >> > > > > >> --- a
> > > > >> >> > > > > >> > similar error is ConcurrentTransactionsException,
> > i.e.
> > > > if
> > > > >> >> Kafka
> > > > >> >> > is
> > > > >> >> > > > > >> bug-free
> > > > >> >> > > > > >> > these exceptions would only be returned if users
> > try to
> > > > do
> > > > >> >> > > something
> > > > >> >> > > > > >> wrong,
> > > > >> >> > > > > >> > e.g. calling abortTxn right after a commitTxn,
> etc.
> > So
> > > > I'm
> > > > >> >> > > thinking
> > > > >> >> > > > it
> > > > >> >> > > > > >> > should be a non-fatal error instead of a fatal
> > error,
> > > > >> wdyt?
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> > 3) *KafkaException*: case i "indicates fatal
> > > > transactional
> > > > >> >> > > sequence
> > > > >> >> > > > > >> > (Fatal)", this is a bit conflicting with the
> > > > >> >> > > > *OutOfSequenceException*
> > > > >> >> > > > > >> that
> > > > >> >> > > > > >> > is treated as non-fatal. I guess your proposal is
> > that
> > > > >> >> > > > > >> > OutOfOrderSequenceException would be treated
> either
> > as
> > > > >> fatal
> > > > >> >> > with
> > > > >> >> > > > > >> > transactional producer, or non-fatal with
> idempotent
> > > > >> >> producer,
> > > > >> >> > is
> > > > >> >> > > > that
> > > > >> >> > > > > >> > right? If the producer is only configured with
> > > > idempotency
> > > > >> >> but
> > > > >> >> > not
> > > > >> >> > > > > >> > transaction, then throwing a
> > > > >> >> TransactionStateCorruptedException
> > > > >> >> > > for
> > > > >> >> > > > > >> > non-fatal errors would be confusing since users
> are
> > not
> > > > >> using
> > > > >> >> > > > > >> transactions
> > > > >> >> > > > > >> > at all.. So I suggest we always throw
> > > > >> OutOfSequenceException
> > > > >> >> > as-is
> > > > >> >> > > > > (i.e.
> > > > >> >> > > > > >> > not wrapped) no matter how the producer is
> > configured,
> > > > and
> > > > >> >> let
> > > > >> >> > the
> > > > >> >> > > > > >> caller
> > > > >> >> > > > > >> > decide how to handle it based on whether it is
> only
> > > > >> >> idempotent
> > > > >> >> > or
> > > > >> >> > > > > >> > transactional itself.
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> > 4) Besides all the txn APIs, the `send()`
> callback /
> > > > >> future
> > > > >> >> can
> > > > >> >> > > also
> > > > >> >> > > > > >> throw
> > > > >> >> > > > > >> > txn-related exceptions, I think this KIP should
> also
> > > > cover
> > > > >> >> this
> > > > >> >> > > API
> > > > >> >> > > > as
> > > > >> >> > > > > >> well?
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> > 5) This is related to 1/2) above: sometimes those
> > > > >> non-fatal
> > > > >> >> > errors
> > > > >> >> > > > > like
> > > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not due to
> the
> > > > state
> > > > >> >> being
> > > > >> >> > > > > >> corrupted
> > > > >> >> > > > > >> > at the broker side, but maybe users are doing
> > something
> > > > >> >> wrong.
> > > > >> >> > So
> > > > >> >> > > > I'm
> > > > >> >> > > > > >> > wondering if we should further distinguish those
> > > > non-fatal
> > > > >> >> > errors
> > > > >> >> > > > > >> between
> > > > >> >> > > > > >> > a) those that are caused by Kafka itself, e.g. a
> > broker
> > > > >> timed
> > > > >> >> > out
> > > > >> >> > > > and
> > > > >> >> > > > > >> > aborted a txn and later an endTxn request is
> > received,
> > > > >> and b)
> > > > >> >> > the
> > > > >> >> > > > > user's
> > > > >> >> > > > > >> > API call pattern is incorrect, causing the request
> > to be
> > > > >> >> > rejected
> > > > >> >> > > > with
> > > > >> >> > > > > >> an
> > > > >> >> > > > > >> > error code from the broker.
> > > > >> >> *TransactionStateCorruptedException*
> > > > >> >> > > > feels
> > > > >> >> > > > > >> to
> > > > >> >> > > > > >> > me more like for case a), but not case b).
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> > Guozhang
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > > > >> >> > > > > reluctanthero104@gmail.com>
> > > > >> >> > > > > >> > wrote:
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> >> Thanks Matthias, I think your proposal makes
> sense
> > as
> > > > >> well,
> > > > >> >> on
> > > > >> >> > > the
> > > > >> >> > > > > pro
> > > > >> >> > > > > >> side
> > > > >> >> > > > > >> >> we could have a universally agreed exception type
> > to be
> > > > >> >> caught
> > > > >> >> > by
> > > > >> >> > > > the
> > > > >> >> > > > > >> user,
> > > > >> >> > > > > >> >> without having an extra layer on top of the
> actual
> > > > >> >> exceptions.
> > > > >> >> > I
> > > > >> >> > > > > could
> > > > >> >> > > > > >> see
> > > > >> >> > > > > >> >> some issue on downsides:
> > > > >> >> > > > > >> >>
> > > > >> >> > > > > >> >> 1. The exception hierarchy will be more complex.
> > Right
> > > > >> now
> > > > >> >> we
> > > > >> >> > > have
> > > > >> >> > > > > >> >> RetriableException type, if we are going to add a
> > > > >> >> > > > > >> >> `ProducerRetriableException` type, we have to put
> > this
> > > > >> new
> > > > >> >> > > > interface
> > > > >> >> > > > > >> as the
> > > > >> >> > > > > >> >> parent of the RetriableException, because not all
> > > > thrown
> > > > >> >> > > non-fatal
> > > > >> >> > > > > >> >> exceptions are `retriable` in general for
> > producer, for
> > > > >> >> example
> > > > >> >> > > > > >> >> <
> > > > >> >> > > > > >> >>
> > > > >> >> > > > > >>
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > >> >> > > > > >> >>> .
> > > > >> >> > > > > >> >> We could have an empty interface
> > > > >> >> `ProducerRetriableException`
> > > > >> >> > to
> > > > >> >> > > > let
> > > > >> >> > > > > >> all
> > > > >> >> > > > > >> >> the thrown exceptions implement for sure, but
> it's
> > a
> > > > bit
> > > > >> >> > > unorthodox
> > > > >> >> > > > > in
> > > > >> >> > > > > >> the
> > > > >> >> > > > > >> >> way we deal with exceptions in general.
> > > > >> >> > > > > >> >>
> > > > >> >> > > > > >> >> 2. There are cases where we throw a
> KafkaException
> > > > >> wrapping
> > > > >> >> > > another
> > > > >> >> > > > > >> >> KafkaException as either fatal or non-fatal. If
> we
> > use
> > > > an
> > > > >> >> > > interface
> > > > >> >> > > > > to
> > > > >> >> > > > > >> >> solve #1, it is also required to implement
> another
> > > > >> bloated
> > > > >> >> > > > exception
> > > > >> >> > > > > >> class
> > > > >> >> > > > > >> >> which could replace KafkaException type, as we
> > couldn't
> > > > >> mark
> > > > >> >> > > > > >> KafkaException
> > > > >> >> > > > > >> >> as retriable for sure.
> > > > >> >> > > > > >> >>
> > > > >> >> > > > > >> >> 3. In terms of the encapsulation, wrapping means
> we
> > > > could
> > > > >> >> limit
> > > > >> >> > > the
> > > > >> >> > > > > >> scope
> > > > >> >> > > > > >> >> of affection to the producer only, which is
> > important
> > > > >> since
> > > > >> >> we
> > > > >> >> > > > don't
> > > > >> >> > > > > >> want
> > > > >> >> > > > > >> >> shared exception types to implement a
> > producer-related
> > > > >> >> > interface,
> > > > >> >> > > > > such
> > > > >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> > > > >> >> > > > > >> >>
> > > > >> >> > > > > >> >> Best,
> > > > >> >> > > > > >> >> Boyang
> > > > >> >> > > > > >> >>
> > > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
> > > > >> >> > mjsax@apache.org
> > > > >> >> > > >
> > > > >> >> > > > > >> wrote:
> > > > >> >> > > > > >> >>
> > > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > > >> >> > > > > >> >>>
> > > > >> >> > > > > >> >>> Overall, categorizing exceptions makes a lot of
> > sense.
> > > > >> As I
> > > > >> >> > > don't
> > > > >> >> > > > > know
> > > > >> >> > > > > >> >>> the producer internals well enough, I cannot
> > comment
> > > > on
> > > > >> the
> > > > >> >> > > > > >> >>> categorization in detail though.
> > > > >> >> > > > > >> >>>
> > > > >> >> > > > > >> >>> What I am wondering is, if we should introduce
> an
> > > > >> exception
> > > > >> >> > > > > interface
> > > > >> >> > > > > >> >>> that non-fatal exception would implement instead
> > of
> > > > >> >> creating a
> > > > >> >> > > new
> > > > >> >> > > > > >> class
> > > > >> >> > > > > >> >>> that will wrap non-fatal exceptions? What would
> > be the
> > > > >> >> > pros/cons
> > > > >> >> > > > for
> > > > >> >> > > > > >> >>> both designs?
> > > > >> >> > > > > >> >>>
> > > > >> >> > > > > >> >>>
> > > > >> >> > > > > >> >>> -Matthias
> > > > >> >> > > > > >> >>>
> > > > >> >> > > > > >> >>>
> > > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > > >> >> > > > > >> >>>> Hey there,
> > > > >> >> > > > > >> >>>>
> > > > >> >> > > > > >> >>>> I would like to start a discussion thread for
> > > > KIP-691:
> > > > >> >> > > > > >> >>>>
> > > > >> >> > > > > >> >>>
> > > > >> >> > > > > >> >>
> > > > >> >> > > > > >>
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >>
> > > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > >> >> > > > > >> >>>>
> > > > >> >> > > > > >> >>>> The KIP is aiming to simplify the exception
> > handling
> > > > >> logic
> > > > >> >> > for
> > > > >> >> > > > > >> >>>> transactional Producer users by classifying
> > fatal and
> > > > >> >> > non-fatal
> > > > >> >> > > > > >> >>> exceptions
> > > > >> >> > > > > >> >>>> and throw them correspondingly for easier catch
> > and
> > > > >> retry.
> > > > >> >> > Let
> > > > >> >> > > me
> > > > >> >> > > > > >> know
> > > > >> >> > > > > >> >>> what
> > > > >> >> > > > > >> >>>> you think.
> > > > >> >> > > > > >> >>>>
> > > > >> >> > > > > >> >>>> Best,
> > > > >> >> > > > > >> >>>> Boyang
> > > > >> >> > > > > >> >>>>
> > > > >> >> > > > > >> >>>
> > > > >> >> > > > > >> >>
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >> >
> > > > >> >> > > > > >>
> > > > >> >> > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > --
> > > > >> >> > > > -- Guozhang
> > > > >> >> > > >
> > > > >> >> > >
> > > > >> >> >
> > > > >> >> >
> > > > >> >> > --
> > > > >> >> > -- Guozhang
> > > > >> >> >
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > -- Guozhang
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> -- Guozhang
> > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Boyang,

Thanks for the iterations here. I think this is something we should have
done a long time ago. It sounds like there are two API changes here:

1. We are introducing the `CommitFailedException` to wrap abortable
errors that are raised from `commitTransaction`. This sounds fine to me. As
far as I know, the only case we might need this is when we add support to
let producers recover from coordinator timeouts. Are there any others?

2. We are wrapping non-fatal errors raised from `send` as `KafkaException`.
The motivation for this is less clear to me and it doesn't look like the
example from the KIP depends on it. My concern here is compatibility.
Currently we have the following documentation for the `Callback` API:

```
     *                  Non-Retriable exceptions (fatal, the message will
never be sent):
     *
     *                  InvalidTopicException
     *                  OffsetMetadataTooLargeException
     *                  RecordBatchTooLargeException
     *                  RecordTooLargeException
     *                  UnknownServerException
     *                  UnknownProducerIdException
     *                  InvalidProducerEpochException
     *
     *                  Retriable exceptions (transient, may be covered by
increasing #.retries):
     *
     *                  CorruptRecordException
     *                  InvalidMetadataException
     *                  NotEnoughReplicasAfterAppendException
     *                  NotEnoughReplicasException
     *                  OffsetOutOfRangeException
     *                  TimeoutException
     *                  UnknownTopicOrPartitionException
```

If we wrap all the retriable exceptions documented here as
`KafkaException`, wouldn't that break any error handling that users might
already have?

Thanks,
Jason


On Sat, Jan 23, 2021 at 3:31 AM Hiringuru <in...@hiringuru.com> wrote:

> Why  we are receiving all emails kindly remove us from
> dev@kafka.apache.org we don't want to receive emails anymore.
>
> Thanks
> > On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com> wrote:
> >
> >
> > Thanks Boyang, yes I think I was confused about the different handling of
> > two abortTxn calls, and now I get it was not intentional. I think I do
> not
> > have more concerns.
> >
> > On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Thanks for the clarification Guozhang, I got your point that we want to
> > > have a consistent handling of fatal exceptions being thrown from the
> > > abortTxn. I modified the current template to move the fatal exception
> > > try-catch outside of the processing loop to make sure we could get a
> chance
> > > to close consumer/producer modules. Let me know what you think.
> > >
> > > Best,
> > > Boyang
> > >
> > > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > My understanding is that abortTransaction would only throw when the
> > > > producer is in fatal state. For CommitFailed, the producer should
> still
> > > be
> > > > in the abortable error state, so that abortTransaction call would not
> > > throw.
> > > >
> > > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > >> Or are you going to maintain some internal state such that, the
> > > >> `abortTransaction` in the catch block would never throw again?
> > > >>
> > > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <wa...@gmail.com>
> > > >> wrote:
> > > >>
> > > >> > Hi Boyang/Jason,
> > > >> >
> > > >> > I've also thought about this (i.e. using CommitFailed for all
> > > >> non-fatal),
> > > >> > but what I'm pondering is that, in the catch (CommitFailed) block,
> > > what
> > > >> > would happen if the `producer.abortTransaction();` throws again?
> > > should
> > > >> > that be captured as a fatal and cause the client to close again.
> > > >> >
> > > >> > If yes, then naively the pattern would be:
> > > >> >
> > > >> > ...
> > > >> > catch (CommitFailedException e) {
> > > >> >         // Transaction commit failed with abortable error, user
> could
> > > >> reset
> > > >> >         // the application state and resume with a new
> transaction.
> > > The
> > > >> > root
> > > >> >         // cause was wrapped in the thrown exception.
> > > >> >         resetToLastCommittedPositions(consumer);
> > > >> >         try {
> > > >> >             producer.abortTransaction();
> > > >> >         } catch (KafkaException e) {
> > > >> >             producer.close();
> > > >> >             consumer.close();
> > > >> >             throw e;
> > > >> >         }
> > > >> >     } catch (KafkaException e) {
> > > >> >         producer.close();
> > > >> >         consumer.close();
> > > >> >         throw e;
> > > >> >     }
> > > >> > ...
> > > >> >
> > > >> > Guozhang
> > > >> >
> > > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > > >> reluctanthero104@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> >> Hey Guozhang,
> > > >> >>
> > > >> >> Jason and I were discussing the new API offline and decided to
> take
> > > >> >> another
> > > >> >> approach. Firstly, the reason not to invent a new API with
> returned
> > > >> >> boolean
> > > >> >> flag is for compatibility consideration, since old EOS code
> would not
> > > >> know
> > > >> >> that a given transaction commit was failed internally as they
> don't
> > > >> listen
> > > >> >> to the output flag. Our proposed alternative solution is to let
> > > >> >> *commitTransaction
> > > >> >> throw CommitFailedException whenever the commit failed with
> non-fatal
> > > >> >> exception*, so that on the caller side the handling logic
> becomes:
> > > >> >>
> > > >> >> try {
> > > >> >>         if (shouldCommit) {
> > > >> >>             producer.commitTransaction();
> > > >> >>         } else {
> > > >> >>             resetToLastCommittedPositions(consumer);
> > > >> >>             producer.abortTransaction();
> > > >> >>         }
> > > >> >>     } catch (CommitFailedException e) {
> > > >> >>         // Transaction commit failed with abortable error, user
> could
> > > >> >> reset
> > > >> >>         // the application state and resume with a new
> transaction.
> > > The
> > > >> >> root
> > > >> >>         // cause was wrapped in the thrown exception.
> > > >> >>         resetToLastCommittedPositions(consumer);
> > > >> >>         producer.abortTransaction();
> > > >> >>     } catch (KafkaException e) {
> > > >> >>         producer.close();
> > > >> >>         consumer.close();
> > > >> >>         throw e;
> > > >> >>     }
> > > >> >>
> > > >> >> This approach looks cleaner as all exception types other than
> > > >> CommitFailed
> > > >> >> will doom to be fatal, which is very easy to adopt for users. In
> the
> > > >> >> meantime, we still maintain the commitTxn behavior to throw
> instead
> > > of
> > > >> >> silently failing.
> > > >> >>
> > > >> >> In addition, we decided to drop the recommendation to handle
> > > >> >> TimeoutException and leave it to the users to make the call. The
> > > >> downside
> > > >> >> for blindly calling abortTxn upon timeout is that we could
> result in
> > > an
> > > >> >> illegal state when the commit was already successful on the
> broker
> > > >> >> side. Without a good guarantee on the outcome, overcomplicating
> the
> > > >> >> template should not be encouraged IMHO.
> > > >> >>
> > > >> >> Let me know your thoughts on the new approach here, thank you!
> > > >> >>
> > > >> >> Best,
> > > >> >> Boyang
> > > >> >>
> > > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <
> wangguoz@gmail.com>
> > > >> >> wrote:
> > > >> >>
> > > >> >> > Thanks for your clarification on 2)/3), that makes sense.
> > > >> >> >
> > > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > > >> >> reluctanthero104@gmail.com>
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> > > Thanks for the input Guozhang, replied inline.
> > > >> >> > >
> > > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> > > wangguoz@gmail.com>
> > > >> >> > wrote:
> > > >> >> > >
> > > >> >> > > > Hello Boyang,
> > > >> >> > > >
> > > >> >> > > > Thanks for the updated KIP. I read it again and have the
> > > >> following
> > > >> >> > > > thoughts:
> > > >> >> > > >
> > > >> >> > > > 0. I'm a bit concerned that if commitTxn does not throw any
> > > >> >> non-fatal
> > > >> >> > > > exception, and instead we rely on the subsequent beginTxn
> call
> > > to
> > > >> >> > throw,
> > > >> >> > > it
> > > >> >> > > > may violate some callers with a pattern that relying on
> > > >> commitTxn to
> > > >> >> > > > succeed to make some non-rollback operations. For example:
> > > >> >> > > >
> > > >> >> > > > beginTxn()
> > > >> >> > > > // do some read-write on my local DB
> > > >> >> > > > commitTxn()
> > > >> >> > > > // if commitTxn succeeds, then commit the DB
> > > >> >> > > >
> > > >> >> > > > -------------
> > > >> >> > > >
> > > >> >> > > > The issue is that, committing DB is a non-rollback
> operation,
> > > and
> > > >> >> users
> > > >> >> > > may
> > > >> >> > > > just rely on commitTxn to return without error to make this
> > > >> >> > non-rollback
> > > >> >> > > > call. Of course we can just claim this pattern is not
> > > legitimate
> > > >> >> and is
> > > >> >> > > not
> > > >> >> > > > the right way of doing things, but many users may naturally
> > > adopt
> > > >> >> this
> > > >> >> > > > pattern.
> > > >> >> > > >
> > > >> >> > > > So maybe we should still let commitTxn also throw non-fatal
> > > >> >> exceptions,
> > > >> >> > > in
> > > >> >> > > > which case we would then call abortTxn again.
> > > >> >> > > >
> > > >> >> > > > But if we do this, the pattern becomes:
> > > >> >> > > >
> > > >> >> > > > try {
> > > >> >> > > >    beginTxn()
> > > >> >> > > >    // do something
> > > >> >> > > > } catch (Exception) {
> > > >> >> > > >    shouldCommit = false;
> > > >> >> > > > }
> > > >> >> > > >
> > > >> >> > > > if (shouldCommit) {
> > > >> >> > > >     try {
> > > >> >> > > >         commitTxn()
> > > >> >> > > >     } catch (...) {        // enumerate all fatal
> exceptions
> > > >> >> > > >         shutdown()
> > > >> >> > > >     } catch (KafkaException) {
> > > >> >> > > >         // non-fatal
> > > >> >> > > >         shouldCommit = false;
> > > >> >> > > >     }
> > > >> >> > > > }
> > > >> >> > > >
> > > >> >> > > > if (!shouldCommit && running()) {
> > > >> >> > > > try {
> > > >> >> > > >         abortTxn()
> > > >> >> > > >     } catch (KafkaException) {
> > > >> >> > > >         // only throw fatal
> > > >> >> > > >         shutdown()
> > > >> >> > > >     }
> > > >> >> > > > }
> > > >> >> > > >
> > > >> >> > > > ---------------------
> > > >> >> > > >
> > > >> >> > > > Which is much more complicated.
> > > >> >> > > >
> > > >> >> > > > Thank makes me think, the alternative we have discussed
> offline
> > > >> may
> > > >> >> be
> > > >> >> > > > better: let commitTxn() to return a boolean flag.
> > > >> >> > > >
> > > >> >> > > > * If it returns true, it means the commit succeeded. Users
> can
> > > >> >> > > comfortably
> > > >> >> > > > continue and do any external non-rollback operations if
> they
> > > >> like.
> > > >> >> > > > * If it returns false, it means the commit failed with
> > > non-fatal
> > > >> >> error
> > > >> >> > > *AND
> > > >> >> > > > the txn has been aborted*. Users do not need to call
> abortTxn
> > > >> again.
> > > >> >> > > > * If it throws, then it means fatal errors. Users should
> shut
> > > >> down
> > > >> >> the
> > > >> >> > > > client.
> > > >> >> > > >
> > > >> >> > > > In this case, the pattern becomes:
> > > >> >> > > >
> > > >> >> > > > try {
> > > >> >> > > >    beginTxn()
> > > >> >> > > >    // do something
> > > >> >> > > > } catch (Exception) {
> > > >> >> > > >    shouldCommit = false;
> > > >> >> > > > }
> > > >> >> > > >
> > > >> >> > > > try {
> > > >> >> > > >     if (shouldCommit) {
> > > >> >> > > >         commitSucceeded = commitTxn()
> > > >> >> > > >     } else {
> > > >> >> > > >         // reset offsets, rollback operations, etc
> > > >> >> > > >         abortTxn()
> > > >> >> > > >     }
> > > >> >> > > > } catch (KafkaException) {
> > > >> >> > > >     // only throw fatal
> > > >> >> > > >     shutdown()
> > > >> >> > > > }
> > > >> >> > > >
> > > >> >> > > > if (commitSucceeded)
> > > >> >> > > >    // do other non-rollbackable things
> > > >> >> > > > else
> > > >> >> > > >    // reset offsets, rollback operations, etc
> > > >> >> > > >
> > > >> >> > > > -------------------------
> > > >> >> > > >
> > > >> >> > > > Of course, if we want to have better visibility into what
> > > caused
> > > >> the
> > > >> >> > > commit
> > > >> >> > > > to fail and txn to abort. We can let the return type be an
> enum
> > > >> >> instead
> > > >> >> > > of
> > > >> >> > > > a flag. But my main idea is to still let the commitTxn be
> the
> > > >> final
> > > >> >> > point
> > > >> >> > > > users can tell whether this txn succeeded or not, instead
> of
> > > >> >> relying on
> > > >> >> > > the
> > > >> >> > > > next beginTxn() call.
> > > >> >> > > >
> > > >> >> > > > I agree that adding a boolean flag is indeed useful in this
> > > case.
> > > >> >> Will
> > > >> >> > > update the KIP.
> > > >> >> > >
> > > >> >> > > 1. Re: "while maintaining the behavior to throw fatal
> exception
> > > in
> > > >> >> raw"
> > > >> >> > not
> > > >> >> > > > sure what do you mean by "throw" here. Are you proposing
> the
> > > >> >> callback
> > > >> >> > > would
> > > >> >> > > > *pass* (not throw) in any fatal exceptions when triggered
> > > without
> > > >> >> > > wrapping?
> > > >> >> > > >
> > > >> >> > > > Yes, I want to say *pass*, the benefit is to make the end
> > > user's
> > > >> >> > > expectation consistent
> > > >> >> > > regarding exception handling.
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > > 2. I'm not sure I understand the claim regarding the
> callback
> > > >> that
> > > >> >> "In
> > > >> >> > > EOS
> > > >> >> > > > setup, it is not required to handle the exception". Are you
> > > >> >> proposing
> > > >> >> > > that,
> > > >> >> > > > e.g. in Streams, we do not try to handle any exceptions if
> EOS
> > > is
> > > >> >> > enabled
> > > >> >> > > > in the callback anymore, but just let commitTxn() itself to
> > > fail
> > > >> to
> > > >> >> be
> > > >> >> > > > notified about the problem? Personally I think in Streams
> we
> > > >> should
> > > >> >> > just
> > > >> >> > > > make the handling logic of the callback to be consistent
> > > >> regardless
> > > >> >> of
> > > >> >> > > the
> > > >> >> > > > EOS settings (today we have different logic depending on
> this
> > > >> logic,
> > > >> >> > > which
> > > >> >> > > > I think could be unified as well).
> > > >> >> > > >
> > > >> >> > > > My idea originates from the claim on send API:
> > > >> >> > > "When used as part of a transaction, it is not necessary to
> > > define
> > > >> a
> > > >> >> > > callback or check the result of the future  in order to
> detect
> > > >> errors
> > > >> >> > from
> > > >> >> > > <code>send</code>. "
> > > >> >> > > My understanding is that for EOS, the exception will be
> detected
> > > by
> > > >> >> > calling
> > > >> >> > > transactional APIs either way, so it is a duplicate handling
> to
> > > >> track
> > > >> >> > > all the sendExceptions in RecordCollector. However, I looked
> up
> > > >> >> > > sendException is being used today as follow:
> > > >> >> > >
> > > >> >> > > 1. Pass to "ProductionExceptionHandler" for any default or
> > > >> customized
> > > >> >> > > exception handler to handle
> > > >> >> > > 2. Stop collecting offset info or new exceptions
> > > >> >> > > 3. Check and rethrow exceptions in close(), flush() or new
> send()
> > > >> >> calls
> > > >> >> > >
> > > >> >> > > To my understanding, we should still honor the commitment to
> #1
> > > for
> > > >> >> any
> > > >> >> > > user customized implementation. The #2 does not really
> affect our
> > > >> >> > decision
> > > >> >> > > upon EOS. The #3 here is still valuable as it could help us
> fail
> > > >> fast
> > > >> >> in
> > > >> >> > > new send() instead of waiting to later stage of processing.
> In
> > > that
> > > >> >> > sense,
> > > >> >> > > I agree we should continue to record send exceptions even
> under
> > > EOS
> > > >> >> case
> > > >> >> > to
> > > >> >> > > ensure the strength of stream side Producer logic. On the
> safer
> > > >> side,
> > > >> >> we
> > > >> >> > no
> > > >> >> > > longer need to wrap certain fatal exceptions like
> ProducerFenced
> > > as
> > > >> >> > > TaskMigrated, since they should not crash the stream thread
> if
> > > >> thrown
> > > >> >> in
> > > >> >> > > raw format, once we adopt the new processing model in the
> send
> > > >> phase.
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > >
> > > >> >> > > > Guozhang
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > > >> >> > reluctanthero104@gmail.com>
> > > >> >> > > > wrote:
> > > >> >> > > >
> > > >> >> > > > > Thanks for everyone's feedback so far. I have polished
> the
> > > KIP
> > > >> >> after
> > > >> >> > > > > offline discussion with some folks working on EOS to
> make the
> > > >> >> > exception
> > > >> >> > > > > handling more lightweight. The essential change is that
> we
> > > are
> > > >> not
> > > >> >> > > > > inventing a new intermediate exception type, but instead
> > > >> >> separating a
> > > >> >> > > > full
> > > >> >> > > > > transaction into two phases:
> > > >> >> > > > >
> > > >> >> > > > > 1. The data transmission phase
> > > >> >> > > > > 2. The commit phase
> > > >> >> > > > >
> > > >> >> > > > > This way for any exception thrown from phase 1, will be
> an
> > > >> >> indicator
> > > >> >> > in
> > > >> >> > > > > phase 2 whether we should do commit or abort, and from
> now on
> > > >> >> > > > > `commitTransaction` should only throw fatal exceptions,
> > > >> similar to
> > > >> >> > > > > `abortTransaction`, so that any KafkaException caught in
> the
> > > >> >> commit
> > > >> >> > > phase
> > > >> >> > > > > will be definitely fatal to crash the app. For more
> advanced
> > > >> users
> > > >> >> > such
> > > >> >> > > > as
> > > >> >> > > > > Streams, we have the ability to further wrap selected
> types
> > > of
> > > >> >> fatal
> > > >> >> > > > > exceptions to trigger task migration if necessary.
> > > >> >> > > > >
> > > >> >> > > > > More details in the KIP, feel free to take another look,
> > > >> thanks!
> > > >> >> > > > >
> > > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > > >> >> > > reluctanthero104@gmail.com>
> > > >> >> > > > > wrote:
> > > >> >> > > > >
> > > >> >> > > > > > Thanks Bruno for the feedback.
> > > >> >> > > > > >
> > > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
> > > >> >> bruno@confluent.io>
> > > >> >> > > > wrote:
> > > >> >> > > > > >
> > > >> >> > > > > >> Thanks Boyang for the KIP!
> > > >> >> > > > > >>
> > > >> >> > > > > >> Like Matthias, I do also not know the producer
> internal
> > > well
> > > >> >> > enough
> > > >> >> > > to
> > > >> >> > > > > >> comment on the categorization. However, I think
> having a
> > > >> super
> > > >> >> > > > exception
> > > >> >> > > > > >> (e.g. RetriableException) that  encodes if an
> exception is
> > > >> >> fatal
> > > >> >> > or
> > > >> >> > > > not
> > > >> >> > > > > >> is cleaner, better understandable and less
> error-prone,
> > > >> because
> > > >> >> > > > ideally
> > > >> >> > > > > >> when you add a new non-fatal exception in future you
> just
> > > >> need
> > > >> >> to
> > > >> >> > > > think
> > > >> >> > > > > >> about letting it inherit from the super exception and
> all
> > > >> the
> > > >> >> rest
> > > >> >> > > of
> > > >> >> > > > > >> the code will just behave correctly without the need
> to
> > > wrap
> > > >> >> the
> > > >> >> > new
> > > >> >> > > > > >> exception into another exception each time it is
> thrown
> > > >> (maybe
> > > >> >> it
> > > >> >> > is
> > > >> >> > > > > >> thrown at different location in the code).
> > > >> >> > > > > >>
> > > >> >> > > > > >> As far as I understand the following statement from
> your
> > > >> >> previous
> > > >> >> > > > e-mail
> > > >> >> > > > > >> is the reason that currently such a super exception
> is not
> > > >> >> > possible:
> > > >> >> > > > > >>
> > > >> >> > > > > >> "Right now we have RetriableException type, if we are
> > > going
> > > >> to
> > > >> >> > add a
> > > >> >> > > > > >> `ProducerRetriableException` type, we have to put
> this new
> > > >> >> > interface
> > > >> >> > > > as
> > > >> >> > > > > >> the parent of the RetriableException, because not all
> > > thrown
> > > >> >> > > non-fatal
> > > >> >> > > > > >> exceptions are `retriable` in general for producer"
> > > >> >> > > > > >>
> > > >> >> > > > > >>
> > > >> >> > > > > >> In the list of exceptions in your KIP, I found
> non-fatal
> > > >> >> > exceptions
> > > >> >> > > > that
> > > >> >> > > > > >> do not inherit from RetriableException. I guess those
> are
> > > >> the
> > > >> >> ones
> > > >> >> > > you
> > > >> >> > > > > >> are referring to in your statement:
> > > >> >> > > > > >>
> > > >> >> > > > > >> InvalidProducerEpochException
> > > >> >> > > > > >> InvalidPidMappingException
> > > >> >> > > > > >> TransactionAbortedException
> > > >> >> > > > > >>
> > > >> >> > > > > >> All of those exceptions are non-fatal and do not
> inherit
> > > >> from
> > > >> >> > > > > >> RetriableException. Is there a reason for that? If
> they
> > > >> >> depended
> > > >> >> > > from
> > > >> >> > > > > >> RetriableException we would be a bit closer to a super
> > > >> >> exception I
> > > >> >> > > > > >> mention above.
> > > >> >> > > > > >>
> > > >> >> > > > > >> The reason is that sender may catch those exceptions
> in
> > > the
> > > >> >> > > > > > ProduceResponse, and it currently does infinite
> > > >> >> > > > > > retries on RetriableException. To make sure we could
> > > trigger
> > > >> the
> > > >> >> > > > > > abortTransaction() by catching non-fatal thrown
> > > >> >> > > > > > exceptions and reinitialize the txn state, we chose
> not to
> > > >> let
> > > >> >> > those
> > > >> >> > > > > > exceptions inherit RetriableException so that
> > > >> >> > > > > > they won't cause infinite retry on sender.
> > > >> >> > > > > >
> > > >> >> > > > > >
> > > >> >> > > > > >> With OutOfOrderSequenceException and
> > > >> >> UnknownProducerIdException, I
> > > >> >> > > > think
> > > >> >> > > > > >> to understand that their fatality depends on the type
> > > (i.e.
> > > >> >> > > > > >> configuration) of the producer. That makes it
> difficult to
> > > >> >> have a
> > > >> >> > > > super
> > > >> >> > > > > >> exception that encodes the retriability as mentioned
> > > above.
> > > >> >> Would
> > > >> >> > it
> > > >> >> > > > be
> > > >> >> > > > > >> possible to introduce exceptions that inherit from
> > > >> >> > > RetriableException
> > > >> >> > > > > >> and that are thrown when those exceptions are caught
> from
> > > >> the
> > > >> >> > > brokers
> > > >> >> > > > > >> and the type of the producer is such that the
> exceptions
> > > are
> > > >> >> > > > retriable?
> > > >> >> > > > > >>
> > > >> >> > > > > >> Yea, I think in general the exception type mixing is
> > > >> difficult
> > > >> >> to
> > > >> >> > > get
> > > >> >> > > > > > them all right. I have already proposed another
> solution
> > > >> based
> > > >> >> on
> > > >> >> > my
> > > >> >> > > > > > offline discussion with some folks working on EOS
> > > >> >> > > > > > to make the handling more straightforward for end users
> > > >> without
> > > >> >> the
> > > >> >> > > > need
> > > >> >> > > > > > to distinguish exception fatality.
> > > >> >> > > > > >
> > > >> >> > > > > >> Best,
> > > >> >> > > > > >> Bruno
> > > >> >> > > > > >>
> > > >> >> > > > > >>
> > > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > >> >> > > > > >> > Thanks Boyang for the proposal! I made a pass over
> the
> > > >> list
> > > >> >> and
> > > >> >> > > here
> > > >> >> > > > > are
> > > >> >> > > > > >> > some thoughts:
> > > >> >> > > > > >> >
> > > >> >> > > > > >> > 0) Although this is not part of the public API, I
> think
> > > we
> > > >> >> > should
> > > >> >> > > > make
> > > >> >> > > > > >> sure
> > > >> >> > > > > >> > that the suggested pattern (i.e. user can always
> call
> > > >> >> abortTxn()
> > > >> >> > > > when
> > > >> >> > > > > >> > handling non-fatal errors) are indeed supported.
> E.g. if
> > > >> the
> > > >> >> txn
> > > >> >> > > is
> > > >> >> > > > > >> already
> > > >> >> > > > > >> > aborted by the broker side, then users can still
> call
> > > >> >> abortTxn
> > > >> >> > > which
> > > >> >> > > > > >> would
> > > >> >> > > > > >> > not throw another exception but just be treated as a
> > > >> no-op.
> > > >> >> > > > > >> >
> > > >> >> > > > > >> > 1) *ConcurrentTransactionsException*: I think this
> error
> > > >> can
> > > >> >> > also
> > > >> >> > > be
> > > >> >> > > > > >> > returned but not documented yet. This should be a
> > > >> non-fatal
> > > >> >> > error.
> > > >> >> > > > > >> >
> > > >> >> > > > > >> > 2) *InvalidTxnStateException*: this error is
> returned
> > > from
> > > >> >> > broker
> > > >> >> > > > when
> > > >> >> > > > > >> txn
> > > >> >> > > > > >> > state transition failed (e.g. it is trying to
> transit to
> > > >> >> > > > > complete-commit
> > > >> >> > > > > >> > while the current state is not prepare-commit). This
> > > error
> > > >> >> could
> > > >> >> > > > > >> indicates
> > > >> >> > > > > >> > a bug on the client internal code or the broker
> code,
> > > OR a
> > > >> >> user
> > > >> >> > > > error
> > > >> >> > > > > >> --- a
> > > >> >> > > > > >> > similar error is ConcurrentTransactionsException,
> i.e.
> > > if
> > > >> >> Kafka
> > > >> >> > is
> > > >> >> > > > > >> bug-free
> > > >> >> > > > > >> > these exceptions would only be returned if users
> try to
> > > do
> > > >> >> > > something
> > > >> >> > > > > >> wrong,
> > > >> >> > > > > >> > e.g. calling abortTxn right after a commitTxn, etc.
> So
> > > I'm
> > > >> >> > > thinking
> > > >> >> > > > it
> > > >> >> > > > > >> > should be a non-fatal error instead of a fatal
> error,
> > > >> wdyt?
> > > >> >> > > > > >> >
> > > >> >> > > > > >> > 3) *KafkaException*: case i "indicates fatal
> > > transactional
> > > >> >> > > sequence
> > > >> >> > > > > >> > (Fatal)", this is a bit conflicting with the
> > > >> >> > > > *OutOfSequenceException*
> > > >> >> > > > > >> that
> > > >> >> > > > > >> > is treated as non-fatal. I guess your proposal is
> that
> > > >> >> > > > > >> > OutOfOrderSequenceException would be treated either
> as
> > > >> fatal
> > > >> >> > with
> > > >> >> > > > > >> > transactional producer, or non-fatal with idempotent
> > > >> >> producer,
> > > >> >> > is
> > > >> >> > > > that
> > > >> >> > > > > >> > right? If the producer is only configured with
> > > idempotency
> > > >> >> but
> > > >> >> > not
> > > >> >> > > > > >> > transaction, then throwing a
> > > >> >> TransactionStateCorruptedException
> > > >> >> > > for
> > > >> >> > > > > >> > non-fatal errors would be confusing since users are
> not
> > > >> using
> > > >> >> > > > > >> transactions
> > > >> >> > > > > >> > at all.. So I suggest we always throw
> > > >> OutOfSequenceException
> > > >> >> > as-is
> > > >> >> > > > > (i.e.
> > > >> >> > > > > >> > not wrapped) no matter how the producer is
> configured,
> > > and
> > > >> >> let
> > > >> >> > the
> > > >> >> > > > > >> caller
> > > >> >> > > > > >> > decide how to handle it based on whether it is only
> > > >> >> idempotent
> > > >> >> > or
> > > >> >> > > > > >> > transactional itself.
> > > >> >> > > > > >> >
> > > >> >> > > > > >> > 4) Besides all the txn APIs, the `send()` callback /
> > > >> future
> > > >> >> can
> > > >> >> > > also
> > > >> >> > > > > >> throw
> > > >> >> > > > > >> > txn-related exceptions, I think this KIP should also
> > > cover
> > > >> >> this
> > > >> >> > > API
> > > >> >> > > > as
> > > >> >> > > > > >> well?
> > > >> >> > > > > >> >
> > > >> >> > > > > >> > 5) This is related to 1/2) above: sometimes those
> > > >> non-fatal
> > > >> >> > errors
> > > >> >> > > > > like
> > > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not due to the
> > > state
> > > >> >> being
> > > >> >> > > > > >> corrupted
> > > >> >> > > > > >> > at the broker side, but maybe users are doing
> something
> > > >> >> wrong.
> > > >> >> > So
> > > >> >> > > > I'm
> > > >> >> > > > > >> > wondering if we should further distinguish those
> > > non-fatal
> > > >> >> > errors
> > > >> >> > > > > >> between
> > > >> >> > > > > >> > a) those that are caused by Kafka itself, e.g. a
> broker
> > > >> timed
> > > >> >> > out
> > > >> >> > > > and
> > > >> >> > > > > >> > aborted a txn and later an endTxn request is
> received,
> > > >> and b)
> > > >> >> > the
> > > >> >> > > > > user's
> > > >> >> > > > > >> > API call pattern is incorrect, causing the request
> to be
> > > >> >> > rejected
> > > >> >> > > > with
> > > >> >> > > > > >> an
> > > >> >> > > > > >> > error code from the broker.
> > > >> >> *TransactionStateCorruptedException*
> > > >> >> > > > feels
> > > >> >> > > > > >> to
> > > >> >> > > > > >> > me more like for case a), but not case b).
> > > >> >> > > > > >> >
> > > >> >> > > > > >> >
> > > >> >> > > > > >> > Guozhang
> > > >> >> > > > > >> >
> > > >> >> > > > > >> >
> > > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > > >> >> > > > > reluctanthero104@gmail.com>
> > > >> >> > > > > >> > wrote:
> > > >> >> > > > > >> >
> > > >> >> > > > > >> >> Thanks Matthias, I think your proposal makes sense
> as
> > > >> well,
> > > >> >> on
> > > >> >> > > the
> > > >> >> > > > > pro
> > > >> >> > > > > >> side
> > > >> >> > > > > >> >> we could have a universally agreed exception type
> to be
> > > >> >> caught
> > > >> >> > by
> > > >> >> > > > the
> > > >> >> > > > > >> user,
> > > >> >> > > > > >> >> without having an extra layer on top of the actual
> > > >> >> exceptions.
> > > >> >> > I
> > > >> >> > > > > could
> > > >> >> > > > > >> see
> > > >> >> > > > > >> >> some issue on downsides:
> > > >> >> > > > > >> >>
> > > >> >> > > > > >> >> 1. The exception hierarchy will be more complex.
> Right
> > > >> now
> > > >> >> we
> > > >> >> > > have
> > > >> >> > > > > >> >> RetriableException type, if we are going to add a
> > > >> >> > > > > >> >> `ProducerRetriableException` type, we have to put
> this
> > > >> new
> > > >> >> > > > interface
> > > >> >> > > > > >> as the
> > > >> >> > > > > >> >> parent of the RetriableException, because not all
> > > thrown
> > > >> >> > > non-fatal
> > > >> >> > > > > >> >> exceptions are `retriable` in general for
> producer, for
> > > >> >> example
> > > >> >> > > > > >> >> <
> > > >> >> > > > > >> >>
> > > >> >> > > > > >>
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > >> >> > > > > >> >>> .
> > > >> >> > > > > >> >> We could have an empty interface
> > > >> >> `ProducerRetriableException`
> > > >> >> > to
> > > >> >> > > > let
> > > >> >> > > > > >> all
> > > >> >> > > > > >> >> the thrown exceptions implement for sure, but it's
> a
> > > bit
> > > >> >> > > unorthodox
> > > >> >> > > > > in
> > > >> >> > > > > >> the
> > > >> >> > > > > >> >> way we deal with exceptions in general.
> > > >> >> > > > > >> >>
> > > >> >> > > > > >> >> 2. There are cases where we throw a KafkaException
> > > >> wrapping
> > > >> >> > > another
> > > >> >> > > > > >> >> KafkaException as either fatal or non-fatal. If we
> use
> > > an
> > > >> >> > > interface
> > > >> >> > > > > to
> > > >> >> > > > > >> >> solve #1, it is also required to implement another
> > > >> bloated
> > > >> >> > > > exception
> > > >> >> > > > > >> class
> > > >> >> > > > > >> >> which could replace KafkaException type, as we
> couldn't
> > > >> mark
> > > >> >> > > > > >> KafkaException
> > > >> >> > > > > >> >> as retriable for sure.
> > > >> >> > > > > >> >>
> > > >> >> > > > > >> >> 3. In terms of the encapsulation, wrapping means we
> > > could
> > > >> >> limit
> > > >> >> > > the
> > > >> >> > > > > >> scope
> > > >> >> > > > > >> >> of affection to the producer only, which is
> important
> > > >> since
> > > >> >> we
> > > >> >> > > > don't
> > > >> >> > > > > >> want
> > > >> >> > > > > >> >> shared exception types to implement a
> producer-related
> > > >> >> > interface,
> > > >> >> > > > > such
> > > >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> > > >> >> > > > > >> >>
> > > >> >> > > > > >> >> Best,
> > > >> >> > > > > >> >> Boyang
> > > >> >> > > > > >> >>
> > > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
> > > >> >> > mjsax@apache.org
> > > >> >> > > >
> > > >> >> > > > > >> wrote:
> > > >> >> > > > > >> >>
> > > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > > >> >> > > > > >> >>>
> > > >> >> > > > > >> >>> Overall, categorizing exceptions makes a lot of
> sense.
> > > >> As I
> > > >> >> > > don't
> > > >> >> > > > > know
> > > >> >> > > > > >> >>> the producer internals well enough, I cannot
> comment
> > > on
> > > >> the
> > > >> >> > > > > >> >>> categorization in detail though.
> > > >> >> > > > > >> >>>
> > > >> >> > > > > >> >>> What I am wondering is, if we should introduce an
> > > >> exception
> > > >> >> > > > > interface
> > > >> >> > > > > >> >>> that non-fatal exception would implement instead
> of
> > > >> >> creating a
> > > >> >> > > new
> > > >> >> > > > > >> class
> > > >> >> > > > > >> >>> that will wrap non-fatal exceptions? What would
> be the
> > > >> >> > pros/cons
> > > >> >> > > > for
> > > >> >> > > > > >> >>> both designs?
> > > >> >> > > > > >> >>>
> > > >> >> > > > > >> >>>
> > > >> >> > > > > >> >>> -Matthias
> > > >> >> > > > > >> >>>
> > > >> >> > > > > >> >>>
> > > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > >> >> > > > > >> >>>> Hey there,
> > > >> >> > > > > >> >>>>
> > > >> >> > > > > >> >>>> I would like to start a discussion thread for
> > > KIP-691:
> > > >> >> > > > > >> >>>>
> > > >> >> > > > > >> >>>
> > > >> >> > > > > >> >>
> > > >> >> > > > > >>
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >>
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > >> >> > > > > >> >>>>
> > > >> >> > > > > >> >>>> The KIP is aiming to simplify the exception
> handling
> > > >> logic
> > > >> >> > for
> > > >> >> > > > > >> >>>> transactional Producer users by classifying
> fatal and
> > > >> >> > non-fatal
> > > >> >> > > > > >> >>> exceptions
> > > >> >> > > > > >> >>>> and throw them correspondingly for easier catch
> and
> > > >> retry.
> > > >> >> > Let
> > > >> >> > > me
> > > >> >> > > > > >> know
> > > >> >> > > > > >> >>> what
> > > >> >> > > > > >> >>>> you think.
> > > >> >> > > > > >> >>>>
> > > >> >> > > > > >> >>>> Best,
> > > >> >> > > > > >> >>>> Boyang
> > > >> >> > > > > >> >>>>
> > > >> >> > > > > >> >>>
> > > >> >> > > > > >> >>
> > > >> >> > > > > >> >
> > > >> >> > > > > >> >
> > > >> >> > > > > >>
> > > >> >> > > > > >
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > --
> > > >> >> > > > -- Guozhang
> > > >> >> > > >
> > > >> >> > >
> > > >> >> >
> > > >> >> >
> > > >> >> > --
> > > >> >> > -- Guozhang
> > > >> >> >
> > > >> >>
> > > >> >
> > > >> >
> > > >> > --
> > > >> > -- Guozhang
> > > >> >
> > > >>
> > > >>
> > > >> --
> > > >> -- Guozhang
> > > >>
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Hiringuru <in...@hiringuru.com>.
Why  we are receiving all emails kindly remove us from dev@kafka.apache.org we don't want to receive emails anymore.

Thanks
> On 01/23/2021 4:14 AM Guozhang Wang <wa...@gmail.com> wrote:
> 
>  
> Thanks Boyang, yes I think I was confused about the different handling of
> two abortTxn calls, and now I get it was not intentional. I think I do not
> have more concerns.
> 
> On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <re...@gmail.com>
> wrote:
> 
> > Thanks for the clarification Guozhang, I got your point that we want to
> > have a consistent handling of fatal exceptions being thrown from the
> > abortTxn. I modified the current template to move the fatal exception
> > try-catch outside of the processing loop to make sure we could get a chance
> > to close consumer/producer modules. Let me know what you think.
> >
> > Best,
> > Boyang
> >
> > On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > My understanding is that abortTransaction would only throw when the
> > > producer is in fatal state. For CommitFailed, the producer should still
> > be
> > > in the abortable error state, so that abortTransaction call would not
> > throw.
> > >
> > > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > >> Or are you going to maintain some internal state such that, the
> > >> `abortTransaction` in the catch block would never throw again?
> > >>
> > >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <wa...@gmail.com>
> > >> wrote:
> > >>
> > >> > Hi Boyang/Jason,
> > >> >
> > >> > I've also thought about this (i.e. using CommitFailed for all
> > >> non-fatal),
> > >> > but what I'm pondering is that, in the catch (CommitFailed) block,
> > what
> > >> > would happen if the `producer.abortTransaction();` throws again?
> > should
> > >> > that be captured as a fatal and cause the client to close again.
> > >> >
> > >> > If yes, then naively the pattern would be:
> > >> >
> > >> > ...
> > >> > catch (CommitFailedException e) {
> > >> >         // Transaction commit failed with abortable error, user could
> > >> reset
> > >> >         // the application state and resume with a new transaction.
> > The
> > >> > root
> > >> >         // cause was wrapped in the thrown exception.
> > >> >         resetToLastCommittedPositions(consumer);
> > >> >         try {
> > >> >             producer.abortTransaction();
> > >> >         } catch (KafkaException e) {
> > >> >             producer.close();
> > >> >             consumer.close();
> > >> >             throw e;
> > >> >         }
> > >> >     } catch (KafkaException e) {
> > >> >         producer.close();
> > >> >         consumer.close();
> > >> >         throw e;
> > >> >     }
> > >> > ...
> > >> >
> > >> > Guozhang
> > >> >
> > >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> > >> reluctanthero104@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> Hey Guozhang,
> > >> >>
> > >> >> Jason and I were discussing the new API offline and decided to take
> > >> >> another
> > >> >> approach. Firstly, the reason not to invent a new API with returned
> > >> >> boolean
> > >> >> flag is for compatibility consideration, since old EOS code would not
> > >> know
> > >> >> that a given transaction commit was failed internally as they don't
> > >> listen
> > >> >> to the output flag. Our proposed alternative solution is to let
> > >> >> *commitTransaction
> > >> >> throw CommitFailedException whenever the commit failed with non-fatal
> > >> >> exception*, so that on the caller side the handling logic becomes:
> > >> >>
> > >> >> try {
> > >> >>         if (shouldCommit) {
> > >> >>             producer.commitTransaction();
> > >> >>         } else {
> > >> >>             resetToLastCommittedPositions(consumer);
> > >> >>             producer.abortTransaction();
> > >> >>         }
> > >> >>     } catch (CommitFailedException e) {
> > >> >>         // Transaction commit failed with abortable error, user could
> > >> >> reset
> > >> >>         // the application state and resume with a new transaction.
> > The
> > >> >> root
> > >> >>         // cause was wrapped in the thrown exception.
> > >> >>         resetToLastCommittedPositions(consumer);
> > >> >>         producer.abortTransaction();
> > >> >>     } catch (KafkaException e) {
> > >> >>         producer.close();
> > >> >>         consumer.close();
> > >> >>         throw e;
> > >> >>     }
> > >> >>
> > >> >> This approach looks cleaner as all exception types other than
> > >> CommitFailed
> > >> >> will doom to be fatal, which is very easy to adopt for users. In the
> > >> >> meantime, we still maintain the commitTxn behavior to throw instead
> > of
> > >> >> silently failing.
> > >> >>
> > >> >> In addition, we decided to drop the recommendation to handle
> > >> >> TimeoutException and leave it to the users to make the call. The
> > >> downside
> > >> >> for blindly calling abortTxn upon timeout is that we could result in
> > an
> > >> >> illegal state when the commit was already successful on the broker
> > >> >> side. Without a good guarantee on the outcome, overcomplicating the
> > >> >> template should not be encouraged IMHO.
> > >> >>
> > >> >> Let me know your thoughts on the new approach here, thank you!
> > >> >>
> > >> >> Best,
> > >> >> Boyang
> > >> >>
> > >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <wa...@gmail.com>
> > >> >> wrote:
> > >> >>
> > >> >> > Thanks for your clarification on 2)/3), that makes sense.
> > >> >> >
> > >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> > >> >> reluctanthero104@gmail.com>
> > >> >> > wrote:
> > >> >> >
> > >> >> > > Thanks for the input Guozhang, replied inline.
> > >> >> > >
> > >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> > wangguoz@gmail.com>
> > >> >> > wrote:
> > >> >> > >
> > >> >> > > > Hello Boyang,
> > >> >> > > >
> > >> >> > > > Thanks for the updated KIP. I read it again and have the
> > >> following
> > >> >> > > > thoughts:
> > >> >> > > >
> > >> >> > > > 0. I'm a bit concerned that if commitTxn does not throw any
> > >> >> non-fatal
> > >> >> > > > exception, and instead we rely on the subsequent beginTxn call
> > to
> > >> >> > throw,
> > >> >> > > it
> > >> >> > > > may violate some callers with a pattern that relying on
> > >> commitTxn to
> > >> >> > > > succeed to make some non-rollback operations. For example:
> > >> >> > > >
> > >> >> > > > beginTxn()
> > >> >> > > > // do some read-write on my local DB
> > >> >> > > > commitTxn()
> > >> >> > > > // if commitTxn succeeds, then commit the DB
> > >> >> > > >
> > >> >> > > > -------------
> > >> >> > > >
> > >> >> > > > The issue is that, committing DB is a non-rollback operation,
> > and
> > >> >> users
> > >> >> > > may
> > >> >> > > > just rely on commitTxn to return without error to make this
> > >> >> > non-rollback
> > >> >> > > > call. Of course we can just claim this pattern is not
> > legitimate
> > >> >> and is
> > >> >> > > not
> > >> >> > > > the right way of doing things, but many users may naturally
> > adopt
> > >> >> this
> > >> >> > > > pattern.
> > >> >> > > >
> > >> >> > > > So maybe we should still let commitTxn also throw non-fatal
> > >> >> exceptions,
> > >> >> > > in
> > >> >> > > > which case we would then call abortTxn again.
> > >> >> > > >
> > >> >> > > > But if we do this, the pattern becomes:
> > >> >> > > >
> > >> >> > > > try {
> > >> >> > > >    beginTxn()
> > >> >> > > >    // do something
> > >> >> > > > } catch (Exception) {
> > >> >> > > >    shouldCommit = false;
> > >> >> > > > }
> > >> >> > > >
> > >> >> > > > if (shouldCommit) {
> > >> >> > > >     try {
> > >> >> > > >         commitTxn()
> > >> >> > > >     } catch (...) {        // enumerate all fatal exceptions
> > >> >> > > >         shutdown()
> > >> >> > > >     } catch (KafkaException) {
> > >> >> > > >         // non-fatal
> > >> >> > > >         shouldCommit = false;
> > >> >> > > >     }
> > >> >> > > > }
> > >> >> > > >
> > >> >> > > > if (!shouldCommit && running()) {
> > >> >> > > > try {
> > >> >> > > >         abortTxn()
> > >> >> > > >     } catch (KafkaException) {
> > >> >> > > >         // only throw fatal
> > >> >> > > >         shutdown()
> > >> >> > > >     }
> > >> >> > > > }
> > >> >> > > >
> > >> >> > > > ---------------------
> > >> >> > > >
> > >> >> > > > Which is much more complicated.
> > >> >> > > >
> > >> >> > > > Thank makes me think, the alternative we have discussed offline
> > >> may
> > >> >> be
> > >> >> > > > better: let commitTxn() to return a boolean flag.
> > >> >> > > >
> > >> >> > > > * If it returns true, it means the commit succeeded. Users can
> > >> >> > > comfortably
> > >> >> > > > continue and do any external non-rollback operations if they
> > >> like.
> > >> >> > > > * If it returns false, it means the commit failed with
> > non-fatal
> > >> >> error
> > >> >> > > *AND
> > >> >> > > > the txn has been aborted*. Users do not need to call abortTxn
> > >> again.
> > >> >> > > > * If it throws, then it means fatal errors. Users should shut
> > >> down
> > >> >> the
> > >> >> > > > client.
> > >> >> > > >
> > >> >> > > > In this case, the pattern becomes:
> > >> >> > > >
> > >> >> > > > try {
> > >> >> > > >    beginTxn()
> > >> >> > > >    // do something
> > >> >> > > > } catch (Exception) {
> > >> >> > > >    shouldCommit = false;
> > >> >> > > > }
> > >> >> > > >
> > >> >> > > > try {
> > >> >> > > >     if (shouldCommit) {
> > >> >> > > >         commitSucceeded = commitTxn()
> > >> >> > > >     } else {
> > >> >> > > >         // reset offsets, rollback operations, etc
> > >> >> > > >         abortTxn()
> > >> >> > > >     }
> > >> >> > > > } catch (KafkaException) {
> > >> >> > > >     // only throw fatal
> > >> >> > > >     shutdown()
> > >> >> > > > }
> > >> >> > > >
> > >> >> > > > if (commitSucceeded)
> > >> >> > > >    // do other non-rollbackable things
> > >> >> > > > else
> > >> >> > > >    // reset offsets, rollback operations, etc
> > >> >> > > >
> > >> >> > > > -------------------------
> > >> >> > > >
> > >> >> > > > Of course, if we want to have better visibility into what
> > caused
> > >> the
> > >> >> > > commit
> > >> >> > > > to fail and txn to abort. We can let the return type be an enum
> > >> >> instead
> > >> >> > > of
> > >> >> > > > a flag. But my main idea is to still let the commitTxn be the
> > >> final
> > >> >> > point
> > >> >> > > > users can tell whether this txn succeeded or not, instead of
> > >> >> relying on
> > >> >> > > the
> > >> >> > > > next beginTxn() call.
> > >> >> > > >
> > >> >> > > > I agree that adding a boolean flag is indeed useful in this
> > case.
> > >> >> Will
> > >> >> > > update the KIP.
> > >> >> > >
> > >> >> > > 1. Re: "while maintaining the behavior to throw fatal exception
> > in
> > >> >> raw"
> > >> >> > not
> > >> >> > > > sure what do you mean by "throw" here. Are you proposing the
> > >> >> callback
> > >> >> > > would
> > >> >> > > > *pass* (not throw) in any fatal exceptions when triggered
> > without
> > >> >> > > wrapping?
> > >> >> > > >
> > >> >> > > > Yes, I want to say *pass*, the benefit is to make the end
> > user's
> > >> >> > > expectation consistent
> > >> >> > > regarding exception handling.
> > >> >> > >
> > >> >> > >
> > >> >> > > > 2. I'm not sure I understand the claim regarding the callback
> > >> that
> > >> >> "In
> > >> >> > > EOS
> > >> >> > > > setup, it is not required to handle the exception". Are you
> > >> >> proposing
> > >> >> > > that,
> > >> >> > > > e.g. in Streams, we do not try to handle any exceptions if EOS
> > is
> > >> >> > enabled
> > >> >> > > > in the callback anymore, but just let commitTxn() itself to
> > fail
> > >> to
> > >> >> be
> > >> >> > > > notified about the problem? Personally I think in Streams we
> > >> should
> > >> >> > just
> > >> >> > > > make the handling logic of the callback to be consistent
> > >> regardless
> > >> >> of
> > >> >> > > the
> > >> >> > > > EOS settings (today we have different logic depending on this
> > >> logic,
> > >> >> > > which
> > >> >> > > > I think could be unified as well).
> > >> >> > > >
> > >> >> > > > My idea originates from the claim on send API:
> > >> >> > > "When used as part of a transaction, it is not necessary to
> > define
> > >> a
> > >> >> > > callback or check the result of the future  in order to detect
> > >> errors
> > >> >> > from
> > >> >> > > <code>send</code>. "
> > >> >> > > My understanding is that for EOS, the exception will be detected
> > by
> > >> >> > calling
> > >> >> > > transactional APIs either way, so it is a duplicate handling to
> > >> track
> > >> >> > > all the sendExceptions in RecordCollector. However, I looked up
> > >> >> > > sendException is being used today as follow:
> > >> >> > >
> > >> >> > > 1. Pass to "ProductionExceptionHandler" for any default or
> > >> customized
> > >> >> > > exception handler to handle
> > >> >> > > 2. Stop collecting offset info or new exceptions
> > >> >> > > 3. Check and rethrow exceptions in close(), flush() or new send()
> > >> >> calls
> > >> >> > >
> > >> >> > > To my understanding, we should still honor the commitment to #1
> > for
> > >> >> any
> > >> >> > > user customized implementation. The #2 does not really affect our
> > >> >> > decision
> > >> >> > > upon EOS. The #3 here is still valuable as it could help us fail
> > >> fast
> > >> >> in
> > >> >> > > new send() instead of waiting to later stage of processing. In
> > that
> > >> >> > sense,
> > >> >> > > I agree we should continue to record send exceptions even under
> > EOS
> > >> >> case
> > >> >> > to
> > >> >> > > ensure the strength of stream side Producer logic. On the safer
> > >> side,
> > >> >> we
> > >> >> > no
> > >> >> > > longer need to wrap certain fatal exceptions like ProducerFenced
> > as
> > >> >> > > TaskMigrated, since they should not crash the stream thread if
> > >> thrown
> > >> >> in
> > >> >> > > raw format, once we adopt the new processing model in the send
> > >> phase.
> > >> >> > >
> > >> >> > >
> > >> >> > > >
> > >> >> > > > Guozhang
> > >> >> > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > >> >> > reluctanthero104@gmail.com>
> > >> >> > > > wrote:
> > >> >> > > >
> > >> >> > > > > Thanks for everyone's feedback so far. I have polished the
> > KIP
> > >> >> after
> > >> >> > > > > offline discussion with some folks working on EOS to make the
> > >> >> > exception
> > >> >> > > > > handling more lightweight. The essential change is that we
> > are
> > >> not
> > >> >> > > > > inventing a new intermediate exception type, but instead
> > >> >> separating a
> > >> >> > > > full
> > >> >> > > > > transaction into two phases:
> > >> >> > > > >
> > >> >> > > > > 1. The data transmission phase
> > >> >> > > > > 2. The commit phase
> > >> >> > > > >
> > >> >> > > > > This way for any exception thrown from phase 1, will be an
> > >> >> indicator
> > >> >> > in
> > >> >> > > > > phase 2 whether we should do commit or abort, and from now on
> > >> >> > > > > `commitTransaction` should only throw fatal exceptions,
> > >> similar to
> > >> >> > > > > `abortTransaction`, so that any KafkaException caught in the
> > >> >> commit
> > >> >> > > phase
> > >> >> > > > > will be definitely fatal to crash the app. For more advanced
> > >> users
> > >> >> > such
> > >> >> > > > as
> > >> >> > > > > Streams, we have the ability to further wrap selected types
> > of
> > >> >> fatal
> > >> >> > > > > exceptions to trigger task migration if necessary.
> > >> >> > > > >
> > >> >> > > > > More details in the KIP, feel free to take another look,
> > >> thanks!
> > >> >> > > > >
> > >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > >> >> > > reluctanthero104@gmail.com>
> > >> >> > > > > wrote:
> > >> >> > > > >
> > >> >> > > > > > Thanks Bruno for the feedback.
> > >> >> > > > > >
> > >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
> > >> >> bruno@confluent.io>
> > >> >> > > > wrote:
> > >> >> > > > > >
> > >> >> > > > > >> Thanks Boyang for the KIP!
> > >> >> > > > > >>
> > >> >> > > > > >> Like Matthias, I do also not know the producer internal
> > well
> > >> >> > enough
> > >> >> > > to
> > >> >> > > > > >> comment on the categorization. However, I think having a
> > >> super
> > >> >> > > > exception
> > >> >> > > > > >> (e.g. RetriableException) that  encodes if an exception is
> > >> >> fatal
> > >> >> > or
> > >> >> > > > not
> > >> >> > > > > >> is cleaner, better understandable and less error-prone,
> > >> because
> > >> >> > > > ideally
> > >> >> > > > > >> when you add a new non-fatal exception in future you just
> > >> need
> > >> >> to
> > >> >> > > > think
> > >> >> > > > > >> about letting it inherit from the super exception and all
> > >> the
> > >> >> rest
> > >> >> > > of
> > >> >> > > > > >> the code will just behave correctly without the need to
> > wrap
> > >> >> the
> > >> >> > new
> > >> >> > > > > >> exception into another exception each time it is thrown
> > >> (maybe
> > >> >> it
> > >> >> > is
> > >> >> > > > > >> thrown at different location in the code).
> > >> >> > > > > >>
> > >> >> > > > > >> As far as I understand the following statement from your
> > >> >> previous
> > >> >> > > > e-mail
> > >> >> > > > > >> is the reason that currently such a super exception is not
> > >> >> > possible:
> > >> >> > > > > >>
> > >> >> > > > > >> "Right now we have RetriableException type, if we are
> > going
> > >> to
> > >> >> > add a
> > >> >> > > > > >> `ProducerRetriableException` type, we have to put this new
> > >> >> > interface
> > >> >> > > > as
> > >> >> > > > > >> the parent of the RetriableException, because not all
> > thrown
> > >> >> > > non-fatal
> > >> >> > > > > >> exceptions are `retriable` in general for producer"
> > >> >> > > > > >>
> > >> >> > > > > >>
> > >> >> > > > > >> In the list of exceptions in your KIP, I found non-fatal
> > >> >> > exceptions
> > >> >> > > > that
> > >> >> > > > > >> do not inherit from RetriableException. I guess those are
> > >> the
> > >> >> ones
> > >> >> > > you
> > >> >> > > > > >> are referring to in your statement:
> > >> >> > > > > >>
> > >> >> > > > > >> InvalidProducerEpochException
> > >> >> > > > > >> InvalidPidMappingException
> > >> >> > > > > >> TransactionAbortedException
> > >> >> > > > > >>
> > >> >> > > > > >> All of those exceptions are non-fatal and do not inherit
> > >> from
> > >> >> > > > > >> RetriableException. Is there a reason for that? If they
> > >> >> depended
> > >> >> > > from
> > >> >> > > > > >> RetriableException we would be a bit closer to a super
> > >> >> exception I
> > >> >> > > > > >> mention above.
> > >> >> > > > > >>
> > >> >> > > > > >> The reason is that sender may catch those exceptions in
> > the
> > >> >> > > > > > ProduceResponse, and it currently does infinite
> > >> >> > > > > > retries on RetriableException. To make sure we could
> > trigger
> > >> the
> > >> >> > > > > > abortTransaction() by catching non-fatal thrown
> > >> >> > > > > > exceptions and reinitialize the txn state, we chose not to
> > >> let
> > >> >> > those
> > >> >> > > > > > exceptions inherit RetriableException so that
> > >> >> > > > > > they won't cause infinite retry on sender.
> > >> >> > > > > >
> > >> >> > > > > >
> > >> >> > > > > >> With OutOfOrderSequenceException and
> > >> >> UnknownProducerIdException, I
> > >> >> > > > think
> > >> >> > > > > >> to understand that their fatality depends on the type
> > (i.e.
> > >> >> > > > > >> configuration) of the producer. That makes it difficult to
> > >> >> have a
> > >> >> > > > super
> > >> >> > > > > >> exception that encodes the retriability as mentioned
> > above.
> > >> >> Would
> > >> >> > it
> > >> >> > > > be
> > >> >> > > > > >> possible to introduce exceptions that inherit from
> > >> >> > > RetriableException
> > >> >> > > > > >> and that are thrown when those exceptions are caught from
> > >> the
> > >> >> > > brokers
> > >> >> > > > > >> and the type of the producer is such that the exceptions
> > are
> > >> >> > > > retriable?
> > >> >> > > > > >>
> > >> >> > > > > >> Yea, I think in general the exception type mixing is
> > >> difficult
> > >> >> to
> > >> >> > > get
> > >> >> > > > > > them all right. I have already proposed another solution
> > >> based
> > >> >> on
> > >> >> > my
> > >> >> > > > > > offline discussion with some folks working on EOS
> > >> >> > > > > > to make the handling more straightforward for end users
> > >> without
> > >> >> the
> > >> >> > > > need
> > >> >> > > > > > to distinguish exception fatality.
> > >> >> > > > > >
> > >> >> > > > > >> Best,
> > >> >> > > > > >> Bruno
> > >> >> > > > > >>
> > >> >> > > > > >>
> > >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > >> >> > > > > >> > Thanks Boyang for the proposal! I made a pass over the
> > >> list
> > >> >> and
> > >> >> > > here
> > >> >> > > > > are
> > >> >> > > > > >> > some thoughts:
> > >> >> > > > > >> >
> > >> >> > > > > >> > 0) Although this is not part of the public API, I think
> > we
> > >> >> > should
> > >> >> > > > make
> > >> >> > > > > >> sure
> > >> >> > > > > >> > that the suggested pattern (i.e. user can always call
> > >> >> abortTxn()
> > >> >> > > > when
> > >> >> > > > > >> > handling non-fatal errors) are indeed supported. E.g. if
> > >> the
> > >> >> txn
> > >> >> > > is
> > >> >> > > > > >> already
> > >> >> > > > > >> > aborted by the broker side, then users can still call
> > >> >> abortTxn
> > >> >> > > which
> > >> >> > > > > >> would
> > >> >> > > > > >> > not throw another exception but just be treated as a
> > >> no-op.
> > >> >> > > > > >> >
> > >> >> > > > > >> > 1) *ConcurrentTransactionsException*: I think this error
> > >> can
> > >> >> > also
> > >> >> > > be
> > >> >> > > > > >> > returned but not documented yet. This should be a
> > >> non-fatal
> > >> >> > error.
> > >> >> > > > > >> >
> > >> >> > > > > >> > 2) *InvalidTxnStateException*: this error is returned
> > from
> > >> >> > broker
> > >> >> > > > when
> > >> >> > > > > >> txn
> > >> >> > > > > >> > state transition failed (e.g. it is trying to transit to
> > >> >> > > > > complete-commit
> > >> >> > > > > >> > while the current state is not prepare-commit). This
> > error
> > >> >> could
> > >> >> > > > > >> indicates
> > >> >> > > > > >> > a bug on the client internal code or the broker code,
> > OR a
> > >> >> user
> > >> >> > > > error
> > >> >> > > > > >> --- a
> > >> >> > > > > >> > similar error is ConcurrentTransactionsException, i.e.
> > if
> > >> >> Kafka
> > >> >> > is
> > >> >> > > > > >> bug-free
> > >> >> > > > > >> > these exceptions would only be returned if users try to
> > do
> > >> >> > > something
> > >> >> > > > > >> wrong,
> > >> >> > > > > >> > e.g. calling abortTxn right after a commitTxn, etc. So
> > I'm
> > >> >> > > thinking
> > >> >> > > > it
> > >> >> > > > > >> > should be a non-fatal error instead of a fatal error,
> > >> wdyt?
> > >> >> > > > > >> >
> > >> >> > > > > >> > 3) *KafkaException*: case i "indicates fatal
> > transactional
> > >> >> > > sequence
> > >> >> > > > > >> > (Fatal)", this is a bit conflicting with the
> > >> >> > > > *OutOfSequenceException*
> > >> >> > > > > >> that
> > >> >> > > > > >> > is treated as non-fatal. I guess your proposal is that
> > >> >> > > > > >> > OutOfOrderSequenceException would be treated either as
> > >> fatal
> > >> >> > with
> > >> >> > > > > >> > transactional producer, or non-fatal with idempotent
> > >> >> producer,
> > >> >> > is
> > >> >> > > > that
> > >> >> > > > > >> > right? If the producer is only configured with
> > idempotency
> > >> >> but
> > >> >> > not
> > >> >> > > > > >> > transaction, then throwing a
> > >> >> TransactionStateCorruptedException
> > >> >> > > for
> > >> >> > > > > >> > non-fatal errors would be confusing since users are not
> > >> using
> > >> >> > > > > >> transactions
> > >> >> > > > > >> > at all.. So I suggest we always throw
> > >> OutOfSequenceException
> > >> >> > as-is
> > >> >> > > > > (i.e.
> > >> >> > > > > >> > not wrapped) no matter how the producer is configured,
> > and
> > >> >> let
> > >> >> > the
> > >> >> > > > > >> caller
> > >> >> > > > > >> > decide how to handle it based on whether it is only
> > >> >> idempotent
> > >> >> > or
> > >> >> > > > > >> > transactional itself.
> > >> >> > > > > >> >
> > >> >> > > > > >> > 4) Besides all the txn APIs, the `send()` callback /
> > >> future
> > >> >> can
> > >> >> > > also
> > >> >> > > > > >> throw
> > >> >> > > > > >> > txn-related exceptions, I think this KIP should also
> > cover
> > >> >> this
> > >> >> > > API
> > >> >> > > > as
> > >> >> > > > > >> well?
> > >> >> > > > > >> >
> > >> >> > > > > >> > 5) This is related to 1/2) above: sometimes those
> > >> non-fatal
> > >> >> > errors
> > >> >> > > > > like
> > >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not due to the
> > state
> > >> >> being
> > >> >> > > > > >> corrupted
> > >> >> > > > > >> > at the broker side, but maybe users are doing something
> > >> >> wrong.
> > >> >> > So
> > >> >> > > > I'm
> > >> >> > > > > >> > wondering if we should further distinguish those
> > non-fatal
> > >> >> > errors
> > >> >> > > > > >> between
> > >> >> > > > > >> > a) those that are caused by Kafka itself, e.g. a broker
> > >> timed
> > >> >> > out
> > >> >> > > > and
> > >> >> > > > > >> > aborted a txn and later an endTxn request is received,
> > >> and b)
> > >> >> > the
> > >> >> > > > > user's
> > >> >> > > > > >> > API call pattern is incorrect, causing the request to be
> > >> >> > rejected
> > >> >> > > > with
> > >> >> > > > > >> an
> > >> >> > > > > >> > error code from the broker.
> > >> >> *TransactionStateCorruptedException*
> > >> >> > > > feels
> > >> >> > > > > >> to
> > >> >> > > > > >> > me more like for case a), but not case b).
> > >> >> > > > > >> >
> > >> >> > > > > >> >
> > >> >> > > > > >> > Guozhang
> > >> >> > > > > >> >
> > >> >> > > > > >> >
> > >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > >> >> > > > > reluctanthero104@gmail.com>
> > >> >> > > > > >> > wrote:
> > >> >> > > > > >> >
> > >> >> > > > > >> >> Thanks Matthias, I think your proposal makes sense as
> > >> well,
> > >> >> on
> > >> >> > > the
> > >> >> > > > > pro
> > >> >> > > > > >> side
> > >> >> > > > > >> >> we could have a universally agreed exception type to be
> > >> >> caught
> > >> >> > by
> > >> >> > > > the
> > >> >> > > > > >> user,
> > >> >> > > > > >> >> without having an extra layer on top of the actual
> > >> >> exceptions.
> > >> >> > I
> > >> >> > > > > could
> > >> >> > > > > >> see
> > >> >> > > > > >> >> some issue on downsides:
> > >> >> > > > > >> >>
> > >> >> > > > > >> >> 1. The exception hierarchy will be more complex. Right
> > >> now
> > >> >> we
> > >> >> > > have
> > >> >> > > > > >> >> RetriableException type, if we are going to add a
> > >> >> > > > > >> >> `ProducerRetriableException` type, we have to put this
> > >> new
> > >> >> > > > interface
> > >> >> > > > > >> as the
> > >> >> > > > > >> >> parent of the RetriableException, because not all
> > thrown
> > >> >> > > non-fatal
> > >> >> > > > > >> >> exceptions are `retriable` in general for producer, for
> > >> >> example
> > >> >> > > > > >> >> <
> > >> >> > > > > >> >>
> > >> >> > > > > >>
> > >> >> > > > >
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> > https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > >> >> > > > > >> >>> .
> > >> >> > > > > >> >> We could have an empty interface
> > >> >> `ProducerRetriableException`
> > >> >> > to
> > >> >> > > > let
> > >> >> > > > > >> all
> > >> >> > > > > >> >> the thrown exceptions implement for sure, but it's a
> > bit
> > >> >> > > unorthodox
> > >> >> > > > > in
> > >> >> > > > > >> the
> > >> >> > > > > >> >> way we deal with exceptions in general.
> > >> >> > > > > >> >>
> > >> >> > > > > >> >> 2. There are cases where we throw a KafkaException
> > >> wrapping
> > >> >> > > another
> > >> >> > > > > >> >> KafkaException as either fatal or non-fatal. If we use
> > an
> > >> >> > > interface
> > >> >> > > > > to
> > >> >> > > > > >> >> solve #1, it is also required to implement another
> > >> bloated
> > >> >> > > > exception
> > >> >> > > > > >> class
> > >> >> > > > > >> >> which could replace KafkaException type, as we couldn't
> > >> mark
> > >> >> > > > > >> KafkaException
> > >> >> > > > > >> >> as retriable for sure.
> > >> >> > > > > >> >>
> > >> >> > > > > >> >> 3. In terms of the encapsulation, wrapping means we
> > could
> > >> >> limit
> > >> >> > > the
> > >> >> > > > > >> scope
> > >> >> > > > > >> >> of affection to the producer only, which is important
> > >> since
> > >> >> we
> > >> >> > > > don't
> > >> >> > > > > >> want
> > >> >> > > > > >> >> shared exception types to implement a producer-related
> > >> >> > interface,
> > >> >> > > > > such
> > >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> > >> >> > > > > >> >>
> > >> >> > > > > >> >> Best,
> > >> >> > > > > >> >> Boyang
> > >> >> > > > > >> >>
> > >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
> > >> >> > mjsax@apache.org
> > >> >> > > >
> > >> >> > > > > >> wrote:
> > >> >> > > > > >> >>
> > >> >> > > > > >> >>> Thanks for the KIP Boyang!
> > >> >> > > > > >> >>>
> > >> >> > > > > >> >>> Overall, categorizing exceptions makes a lot of sense.
> > >> As I
> > >> >> > > don't
> > >> >> > > > > know
> > >> >> > > > > >> >>> the producer internals well enough, I cannot comment
> > on
> > >> the
> > >> >> > > > > >> >>> categorization in detail though.
> > >> >> > > > > >> >>>
> > >> >> > > > > >> >>> What I am wondering is, if we should introduce an
> > >> exception
> > >> >> > > > > interface
> > >> >> > > > > >> >>> that non-fatal exception would implement instead of
> > >> >> creating a
> > >> >> > > new
> > >> >> > > > > >> class
> > >> >> > > > > >> >>> that will wrap non-fatal exceptions? What would be the
> > >> >> > pros/cons
> > >> >> > > > for
> > >> >> > > > > >> >>> both designs?
> > >> >> > > > > >> >>>
> > >> >> > > > > >> >>>
> > >> >> > > > > >> >>> -Matthias
> > >> >> > > > > >> >>>
> > >> >> > > > > >> >>>
> > >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > >> >> > > > > >> >>>> Hey there,
> > >> >> > > > > >> >>>>
> > >> >> > > > > >> >>>> I would like to start a discussion thread for
> > KIP-691:
> > >> >> > > > > >> >>>>
> > >> >> > > > > >> >>>
> > >> >> > > > > >> >>
> > >> >> > > > > >>
> > >> >> > > > >
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >>
> > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > >> >> > > > > >> >>>>
> > >> >> > > > > >> >>>> The KIP is aiming to simplify the exception handling
> > >> logic
> > >> >> > for
> > >> >> > > > > >> >>>> transactional Producer users by classifying fatal and
> > >> >> > non-fatal
> > >> >> > > > > >> >>> exceptions
> > >> >> > > > > >> >>>> and throw them correspondingly for easier catch and
> > >> retry.
> > >> >> > Let
> > >> >> > > me
> > >> >> > > > > >> know
> > >> >> > > > > >> >>> what
> > >> >> > > > > >> >>>> you think.
> > >> >> > > > > >> >>>>
> > >> >> > > > > >> >>>> Best,
> > >> >> > > > > >> >>>> Boyang
> > >> >> > > > > >> >>>>
> > >> >> > > > > >> >>>
> > >> >> > > > > >> >>
> > >> >> > > > > >> >
> > >> >> > > > > >> >
> > >> >> > > > > >>
> > >> >> > > > > >
> > >> >> > > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > --
> > >> >> > > > -- Guozhang
> > >> >> > > >
> > >> >> > >
> > >> >> >
> > >> >> >
> > >> >> > --
> > >> >> > -- Guozhang
> > >> >> >
> > >> >>
> > >> >
> > >> >
> > >> > --
> > >> > -- Guozhang
> > >> >
> > >>
> > >>
> > >> --
> > >> -- Guozhang
> > >>
> > >
> >
> 
> 
> -- 
> -- Guozhang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Guozhang Wang <wa...@gmail.com>.
Thanks Boyang, yes I think I was confused about the different handling of
two abortTxn calls, and now I get it was not intentional. I think I do not
have more concerns.

On Fri, Jan 22, 2021 at 1:12 PM Boyang Chen <re...@gmail.com>
wrote:

> Thanks for the clarification Guozhang, I got your point that we want to
> have a consistent handling of fatal exceptions being thrown from the
> abortTxn. I modified the current template to move the fatal exception
> try-catch outside of the processing loop to make sure we could get a chance
> to close consumer/producer modules. Let me know what you think.
>
> Best,
> Boyang
>
> On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <re...@gmail.com>
> wrote:
>
> > My understanding is that abortTransaction would only throw when the
> > producer is in fatal state. For CommitFailed, the producer should still
> be
> > in the abortable error state, so that abortTransaction call would not
> throw.
> >
> > On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> >> Or are you going to maintain some internal state such that, the
> >> `abortTransaction` in the catch block would never throw again?
> >>
> >> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <wa...@gmail.com>
> >> wrote:
> >>
> >> > Hi Boyang/Jason,
> >> >
> >> > I've also thought about this (i.e. using CommitFailed for all
> >> non-fatal),
> >> > but what I'm pondering is that, in the catch (CommitFailed) block,
> what
> >> > would happen if the `producer.abortTransaction();` throws again?
> should
> >> > that be captured as a fatal and cause the client to close again.
> >> >
> >> > If yes, then naively the pattern would be:
> >> >
> >> > ...
> >> > catch (CommitFailedException e) {
> >> >         // Transaction commit failed with abortable error, user could
> >> reset
> >> >         // the application state and resume with a new transaction.
> The
> >> > root
> >> >         // cause was wrapped in the thrown exception.
> >> >         resetToLastCommittedPositions(consumer);
> >> >         try {
> >> >             producer.abortTransaction();
> >> >         } catch (KafkaException e) {
> >> >             producer.close();
> >> >             consumer.close();
> >> >             throw e;
> >> >         }
> >> >     } catch (KafkaException e) {
> >> >         producer.close();
> >> >         consumer.close();
> >> >         throw e;
> >> >     }
> >> > ...
> >> >
> >> > Guozhang
> >> >
> >> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
> >> reluctanthero104@gmail.com>
> >> > wrote:
> >> >
> >> >> Hey Guozhang,
> >> >>
> >> >> Jason and I were discussing the new API offline and decided to take
> >> >> another
> >> >> approach. Firstly, the reason not to invent a new API with returned
> >> >> boolean
> >> >> flag is for compatibility consideration, since old EOS code would not
> >> know
> >> >> that a given transaction commit was failed internally as they don't
> >> listen
> >> >> to the output flag. Our proposed alternative solution is to let
> >> >> *commitTransaction
> >> >> throw CommitFailedException whenever the commit failed with non-fatal
> >> >> exception*, so that on the caller side the handling logic becomes:
> >> >>
> >> >> try {
> >> >>         if (shouldCommit) {
> >> >>             producer.commitTransaction();
> >> >>         } else {
> >> >>             resetToLastCommittedPositions(consumer);
> >> >>             producer.abortTransaction();
> >> >>         }
> >> >>     } catch (CommitFailedException e) {
> >> >>         // Transaction commit failed with abortable error, user could
> >> >> reset
> >> >>         // the application state and resume with a new transaction.
> The
> >> >> root
> >> >>         // cause was wrapped in the thrown exception.
> >> >>         resetToLastCommittedPositions(consumer);
> >> >>         producer.abortTransaction();
> >> >>     } catch (KafkaException e) {
> >> >>         producer.close();
> >> >>         consumer.close();
> >> >>         throw e;
> >> >>     }
> >> >>
> >> >> This approach looks cleaner as all exception types other than
> >> CommitFailed
> >> >> will doom to be fatal, which is very easy to adopt for users. In the
> >> >> meantime, we still maintain the commitTxn behavior to throw instead
> of
> >> >> silently failing.
> >> >>
> >> >> In addition, we decided to drop the recommendation to handle
> >> >> TimeoutException and leave it to the users to make the call. The
> >> downside
> >> >> for blindly calling abortTxn upon timeout is that we could result in
> an
> >> >> illegal state when the commit was already successful on the broker
> >> >> side. Without a good guarantee on the outcome, overcomplicating the
> >> >> template should not be encouraged IMHO.
> >> >>
> >> >> Let me know your thoughts on the new approach here, thank you!
> >> >>
> >> >> Best,
> >> >> Boyang
> >> >>
> >> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <wa...@gmail.com>
> >> >> wrote:
> >> >>
> >> >> > Thanks for your clarification on 2)/3), that makes sense.
> >> >> >
> >> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> >> >> reluctanthero104@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> > > Thanks for the input Guozhang, replied inline.
> >> >> > >
> >> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <
> wangguoz@gmail.com>
> >> >> > wrote:
> >> >> > >
> >> >> > > > Hello Boyang,
> >> >> > > >
> >> >> > > > Thanks for the updated KIP. I read it again and have the
> >> following
> >> >> > > > thoughts:
> >> >> > > >
> >> >> > > > 0. I'm a bit concerned that if commitTxn does not throw any
> >> >> non-fatal
> >> >> > > > exception, and instead we rely on the subsequent beginTxn call
> to
> >> >> > throw,
> >> >> > > it
> >> >> > > > may violate some callers with a pattern that relying on
> >> commitTxn to
> >> >> > > > succeed to make some non-rollback operations. For example:
> >> >> > > >
> >> >> > > > beginTxn()
> >> >> > > > // do some read-write on my local DB
> >> >> > > > commitTxn()
> >> >> > > > // if commitTxn succeeds, then commit the DB
> >> >> > > >
> >> >> > > > -------------
> >> >> > > >
> >> >> > > > The issue is that, committing DB is a non-rollback operation,
> and
> >> >> users
> >> >> > > may
> >> >> > > > just rely on commitTxn to return without error to make this
> >> >> > non-rollback
> >> >> > > > call. Of course we can just claim this pattern is not
> legitimate
> >> >> and is
> >> >> > > not
> >> >> > > > the right way of doing things, but many users may naturally
> adopt
> >> >> this
> >> >> > > > pattern.
> >> >> > > >
> >> >> > > > So maybe we should still let commitTxn also throw non-fatal
> >> >> exceptions,
> >> >> > > in
> >> >> > > > which case we would then call abortTxn again.
> >> >> > > >
> >> >> > > > But if we do this, the pattern becomes:
> >> >> > > >
> >> >> > > > try {
> >> >> > > >    beginTxn()
> >> >> > > >    // do something
> >> >> > > > } catch (Exception) {
> >> >> > > >    shouldCommit = false;
> >> >> > > > }
> >> >> > > >
> >> >> > > > if (shouldCommit) {
> >> >> > > >     try {
> >> >> > > >         commitTxn()
> >> >> > > >     } catch (...) {        // enumerate all fatal exceptions
> >> >> > > >         shutdown()
> >> >> > > >     } catch (KafkaException) {
> >> >> > > >         // non-fatal
> >> >> > > >         shouldCommit = false;
> >> >> > > >     }
> >> >> > > > }
> >> >> > > >
> >> >> > > > if (!shouldCommit && running()) {
> >> >> > > > try {
> >> >> > > >         abortTxn()
> >> >> > > >     } catch (KafkaException) {
> >> >> > > >         // only throw fatal
> >> >> > > >         shutdown()
> >> >> > > >     }
> >> >> > > > }
> >> >> > > >
> >> >> > > > ---------------------
> >> >> > > >
> >> >> > > > Which is much more complicated.
> >> >> > > >
> >> >> > > > Thank makes me think, the alternative we have discussed offline
> >> may
> >> >> be
> >> >> > > > better: let commitTxn() to return a boolean flag.
> >> >> > > >
> >> >> > > > * If it returns true, it means the commit succeeded. Users can
> >> >> > > comfortably
> >> >> > > > continue and do any external non-rollback operations if they
> >> like.
> >> >> > > > * If it returns false, it means the commit failed with
> non-fatal
> >> >> error
> >> >> > > *AND
> >> >> > > > the txn has been aborted*. Users do not need to call abortTxn
> >> again.
> >> >> > > > * If it throws, then it means fatal errors. Users should shut
> >> down
> >> >> the
> >> >> > > > client.
> >> >> > > >
> >> >> > > > In this case, the pattern becomes:
> >> >> > > >
> >> >> > > > try {
> >> >> > > >    beginTxn()
> >> >> > > >    // do something
> >> >> > > > } catch (Exception) {
> >> >> > > >    shouldCommit = false;
> >> >> > > > }
> >> >> > > >
> >> >> > > > try {
> >> >> > > >     if (shouldCommit) {
> >> >> > > >         commitSucceeded = commitTxn()
> >> >> > > >     } else {
> >> >> > > >         // reset offsets, rollback operations, etc
> >> >> > > >         abortTxn()
> >> >> > > >     }
> >> >> > > > } catch (KafkaException) {
> >> >> > > >     // only throw fatal
> >> >> > > >     shutdown()
> >> >> > > > }
> >> >> > > >
> >> >> > > > if (commitSucceeded)
> >> >> > > >    // do other non-rollbackable things
> >> >> > > > else
> >> >> > > >    // reset offsets, rollback operations, etc
> >> >> > > >
> >> >> > > > -------------------------
> >> >> > > >
> >> >> > > > Of course, if we want to have better visibility into what
> caused
> >> the
> >> >> > > commit
> >> >> > > > to fail and txn to abort. We can let the return type be an enum
> >> >> instead
> >> >> > > of
> >> >> > > > a flag. But my main idea is to still let the commitTxn be the
> >> final
> >> >> > point
> >> >> > > > users can tell whether this txn succeeded or not, instead of
> >> >> relying on
> >> >> > > the
> >> >> > > > next beginTxn() call.
> >> >> > > >
> >> >> > > > I agree that adding a boolean flag is indeed useful in this
> case.
> >> >> Will
> >> >> > > update the KIP.
> >> >> > >
> >> >> > > 1. Re: "while maintaining the behavior to throw fatal exception
> in
> >> >> raw"
> >> >> > not
> >> >> > > > sure what do you mean by "throw" here. Are you proposing the
> >> >> callback
> >> >> > > would
> >> >> > > > *pass* (not throw) in any fatal exceptions when triggered
> without
> >> >> > > wrapping?
> >> >> > > >
> >> >> > > > Yes, I want to say *pass*, the benefit is to make the end
> user's
> >> >> > > expectation consistent
> >> >> > > regarding exception handling.
> >> >> > >
> >> >> > >
> >> >> > > > 2. I'm not sure I understand the claim regarding the callback
> >> that
> >> >> "In
> >> >> > > EOS
> >> >> > > > setup, it is not required to handle the exception". Are you
> >> >> proposing
> >> >> > > that,
> >> >> > > > e.g. in Streams, we do not try to handle any exceptions if EOS
> is
> >> >> > enabled
> >> >> > > > in the callback anymore, but just let commitTxn() itself to
> fail
> >> to
> >> >> be
> >> >> > > > notified about the problem? Personally I think in Streams we
> >> should
> >> >> > just
> >> >> > > > make the handling logic of the callback to be consistent
> >> regardless
> >> >> of
> >> >> > > the
> >> >> > > > EOS settings (today we have different logic depending on this
> >> logic,
> >> >> > > which
> >> >> > > > I think could be unified as well).
> >> >> > > >
> >> >> > > > My idea originates from the claim on send API:
> >> >> > > "When used as part of a transaction, it is not necessary to
> define
> >> a
> >> >> > > callback or check the result of the future  in order to detect
> >> errors
> >> >> > from
> >> >> > > <code>send</code>. "
> >> >> > > My understanding is that for EOS, the exception will be detected
> by
> >> >> > calling
> >> >> > > transactional APIs either way, so it is a duplicate handling to
> >> track
> >> >> > > all the sendExceptions in RecordCollector. However, I looked up
> >> >> > > sendException is being used today as follow:
> >> >> > >
> >> >> > > 1. Pass to "ProductionExceptionHandler" for any default or
> >> customized
> >> >> > > exception handler to handle
> >> >> > > 2. Stop collecting offset info or new exceptions
> >> >> > > 3. Check and rethrow exceptions in close(), flush() or new send()
> >> >> calls
> >> >> > >
> >> >> > > To my understanding, we should still honor the commitment to #1
> for
> >> >> any
> >> >> > > user customized implementation. The #2 does not really affect our
> >> >> > decision
> >> >> > > upon EOS. The #3 here is still valuable as it could help us fail
> >> fast
> >> >> in
> >> >> > > new send() instead of waiting to later stage of processing. In
> that
> >> >> > sense,
> >> >> > > I agree we should continue to record send exceptions even under
> EOS
> >> >> case
> >> >> > to
> >> >> > > ensure the strength of stream side Producer logic. On the safer
> >> side,
> >> >> we
> >> >> > no
> >> >> > > longer need to wrap certain fatal exceptions like ProducerFenced
> as
> >> >> > > TaskMigrated, since they should not crash the stream thread if
> >> thrown
> >> >> in
> >> >> > > raw format, once we adopt the new processing model in the send
> >> phase.
> >> >> > >
> >> >> > >
> >> >> > > >
> >> >> > > > Guozhang
> >> >> > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> >> >> > reluctanthero104@gmail.com>
> >> >> > > > wrote:
> >> >> > > >
> >> >> > > > > Thanks for everyone's feedback so far. I have polished the
> KIP
> >> >> after
> >> >> > > > > offline discussion with some folks working on EOS to make the
> >> >> > exception
> >> >> > > > > handling more lightweight. The essential change is that we
> are
> >> not
> >> >> > > > > inventing a new intermediate exception type, but instead
> >> >> separating a
> >> >> > > > full
> >> >> > > > > transaction into two phases:
> >> >> > > > >
> >> >> > > > > 1. The data transmission phase
> >> >> > > > > 2. The commit phase
> >> >> > > > >
> >> >> > > > > This way for any exception thrown from phase 1, will be an
> >> >> indicator
> >> >> > in
> >> >> > > > > phase 2 whether we should do commit or abort, and from now on
> >> >> > > > > `commitTransaction` should only throw fatal exceptions,
> >> similar to
> >> >> > > > > `abortTransaction`, so that any KafkaException caught in the
> >> >> commit
> >> >> > > phase
> >> >> > > > > will be definitely fatal to crash the app. For more advanced
> >> users
> >> >> > such
> >> >> > > > as
> >> >> > > > > Streams, we have the ability to further wrap selected types
> of
> >> >> fatal
> >> >> > > > > exceptions to trigger task migration if necessary.
> >> >> > > > >
> >> >> > > > > More details in the KIP, feel free to take another look,
> >> thanks!
> >> >> > > > >
> >> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> >> >> > > reluctanthero104@gmail.com>
> >> >> > > > > wrote:
> >> >> > > > >
> >> >> > > > > > Thanks Bruno for the feedback.
> >> >> > > > > >
> >> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
> >> >> bruno@confluent.io>
> >> >> > > > wrote:
> >> >> > > > > >
> >> >> > > > > >> Thanks Boyang for the KIP!
> >> >> > > > > >>
> >> >> > > > > >> Like Matthias, I do also not know the producer internal
> well
> >> >> > enough
> >> >> > > to
> >> >> > > > > >> comment on the categorization. However, I think having a
> >> super
> >> >> > > > exception
> >> >> > > > > >> (e.g. RetriableException) that  encodes if an exception is
> >> >> fatal
> >> >> > or
> >> >> > > > not
> >> >> > > > > >> is cleaner, better understandable and less error-prone,
> >> because
> >> >> > > > ideally
> >> >> > > > > >> when you add a new non-fatal exception in future you just
> >> need
> >> >> to
> >> >> > > > think
> >> >> > > > > >> about letting it inherit from the super exception and all
> >> the
> >> >> rest
> >> >> > > of
> >> >> > > > > >> the code will just behave correctly without the need to
> wrap
> >> >> the
> >> >> > new
> >> >> > > > > >> exception into another exception each time it is thrown
> >> (maybe
> >> >> it
> >> >> > is
> >> >> > > > > >> thrown at different location in the code).
> >> >> > > > > >>
> >> >> > > > > >> As far as I understand the following statement from your
> >> >> previous
> >> >> > > > e-mail
> >> >> > > > > >> is the reason that currently such a super exception is not
> >> >> > possible:
> >> >> > > > > >>
> >> >> > > > > >> "Right now we have RetriableException type, if we are
> going
> >> to
> >> >> > add a
> >> >> > > > > >> `ProducerRetriableException` type, we have to put this new
> >> >> > interface
> >> >> > > > as
> >> >> > > > > >> the parent of the RetriableException, because not all
> thrown
> >> >> > > non-fatal
> >> >> > > > > >> exceptions are `retriable` in general for producer"
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >> In the list of exceptions in your KIP, I found non-fatal
> >> >> > exceptions
> >> >> > > > that
> >> >> > > > > >> do not inherit from RetriableException. I guess those are
> >> the
> >> >> ones
> >> >> > > you
> >> >> > > > > >> are referring to in your statement:
> >> >> > > > > >>
> >> >> > > > > >> InvalidProducerEpochException
> >> >> > > > > >> InvalidPidMappingException
> >> >> > > > > >> TransactionAbortedException
> >> >> > > > > >>
> >> >> > > > > >> All of those exceptions are non-fatal and do not inherit
> >> from
> >> >> > > > > >> RetriableException. Is there a reason for that? If they
> >> >> depended
> >> >> > > from
> >> >> > > > > >> RetriableException we would be a bit closer to a super
> >> >> exception I
> >> >> > > > > >> mention above.
> >> >> > > > > >>
> >> >> > > > > >> The reason is that sender may catch those exceptions in
> the
> >> >> > > > > > ProduceResponse, and it currently does infinite
> >> >> > > > > > retries on RetriableException. To make sure we could
> trigger
> >> the
> >> >> > > > > > abortTransaction() by catching non-fatal thrown
> >> >> > > > > > exceptions and reinitialize the txn state, we chose not to
> >> let
> >> >> > those
> >> >> > > > > > exceptions inherit RetriableException so that
> >> >> > > > > > they won't cause infinite retry on sender.
> >> >> > > > > >
> >> >> > > > > >
> >> >> > > > > >> With OutOfOrderSequenceException and
> >> >> UnknownProducerIdException, I
> >> >> > > > think
> >> >> > > > > >> to understand that their fatality depends on the type
> (i.e.
> >> >> > > > > >> configuration) of the producer. That makes it difficult to
> >> >> have a
> >> >> > > > super
> >> >> > > > > >> exception that encodes the retriability as mentioned
> above.
> >> >> Would
> >> >> > it
> >> >> > > > be
> >> >> > > > > >> possible to introduce exceptions that inherit from
> >> >> > > RetriableException
> >> >> > > > > >> and that are thrown when those exceptions are caught from
> >> the
> >> >> > > brokers
> >> >> > > > > >> and the type of the producer is such that the exceptions
> are
> >> >> > > > retriable?
> >> >> > > > > >>
> >> >> > > > > >> Yea, I think in general the exception type mixing is
> >> difficult
> >> >> to
> >> >> > > get
> >> >> > > > > > them all right. I have already proposed another solution
> >> based
> >> >> on
> >> >> > my
> >> >> > > > > > offline discussion with some folks working on EOS
> >> >> > > > > > to make the handling more straightforward for end users
> >> without
> >> >> the
> >> >> > > > need
> >> >> > > > > > to distinguish exception fatality.
> >> >> > > > > >
> >> >> > > > > >> Best,
> >> >> > > > > >> Bruno
> >> >> > > > > >>
> >> >> > > > > >>
> >> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> >> >> > > > > >> > Thanks Boyang for the proposal! I made a pass over the
> >> list
> >> >> and
> >> >> > > here
> >> >> > > > > are
> >> >> > > > > >> > some thoughts:
> >> >> > > > > >> >
> >> >> > > > > >> > 0) Although this is not part of the public API, I think
> we
> >> >> > should
> >> >> > > > make
> >> >> > > > > >> sure
> >> >> > > > > >> > that the suggested pattern (i.e. user can always call
> >> >> abortTxn()
> >> >> > > > when
> >> >> > > > > >> > handling non-fatal errors) are indeed supported. E.g. if
> >> the
> >> >> txn
> >> >> > > is
> >> >> > > > > >> already
> >> >> > > > > >> > aborted by the broker side, then users can still call
> >> >> abortTxn
> >> >> > > which
> >> >> > > > > >> would
> >> >> > > > > >> > not throw another exception but just be treated as a
> >> no-op.
> >> >> > > > > >> >
> >> >> > > > > >> > 1) *ConcurrentTransactionsException*: I think this error
> >> can
> >> >> > also
> >> >> > > be
> >> >> > > > > >> > returned but not documented yet. This should be a
> >> non-fatal
> >> >> > error.
> >> >> > > > > >> >
> >> >> > > > > >> > 2) *InvalidTxnStateException*: this error is returned
> from
> >> >> > broker
> >> >> > > > when
> >> >> > > > > >> txn
> >> >> > > > > >> > state transition failed (e.g. it is trying to transit to
> >> >> > > > > complete-commit
> >> >> > > > > >> > while the current state is not prepare-commit). This
> error
> >> >> could
> >> >> > > > > >> indicates
> >> >> > > > > >> > a bug on the client internal code or the broker code,
> OR a
> >> >> user
> >> >> > > > error
> >> >> > > > > >> --- a
> >> >> > > > > >> > similar error is ConcurrentTransactionsException, i.e.
> if
> >> >> Kafka
> >> >> > is
> >> >> > > > > >> bug-free
> >> >> > > > > >> > these exceptions would only be returned if users try to
> do
> >> >> > > something
> >> >> > > > > >> wrong,
> >> >> > > > > >> > e.g. calling abortTxn right after a commitTxn, etc. So
> I'm
> >> >> > > thinking
> >> >> > > > it
> >> >> > > > > >> > should be a non-fatal error instead of a fatal error,
> >> wdyt?
> >> >> > > > > >> >
> >> >> > > > > >> > 3) *KafkaException*: case i "indicates fatal
> transactional
> >> >> > > sequence
> >> >> > > > > >> > (Fatal)", this is a bit conflicting with the
> >> >> > > > *OutOfSequenceException*
> >> >> > > > > >> that
> >> >> > > > > >> > is treated as non-fatal. I guess your proposal is that
> >> >> > > > > >> > OutOfOrderSequenceException would be treated either as
> >> fatal
> >> >> > with
> >> >> > > > > >> > transactional producer, or non-fatal with idempotent
> >> >> producer,
> >> >> > is
> >> >> > > > that
> >> >> > > > > >> > right? If the producer is only configured with
> idempotency
> >> >> but
> >> >> > not
> >> >> > > > > >> > transaction, then throwing a
> >> >> TransactionStateCorruptedException
> >> >> > > for
> >> >> > > > > >> > non-fatal errors would be confusing since users are not
> >> using
> >> >> > > > > >> transactions
> >> >> > > > > >> > at all.. So I suggest we always throw
> >> OutOfSequenceException
> >> >> > as-is
> >> >> > > > > (i.e.
> >> >> > > > > >> > not wrapped) no matter how the producer is configured,
> and
> >> >> let
> >> >> > the
> >> >> > > > > >> caller
> >> >> > > > > >> > decide how to handle it based on whether it is only
> >> >> idempotent
> >> >> > or
> >> >> > > > > >> > transactional itself.
> >> >> > > > > >> >
> >> >> > > > > >> > 4) Besides all the txn APIs, the `send()` callback /
> >> future
> >> >> can
> >> >> > > also
> >> >> > > > > >> throw
> >> >> > > > > >> > txn-related exceptions, I think this KIP should also
> cover
> >> >> this
> >> >> > > API
> >> >> > > > as
> >> >> > > > > >> well?
> >> >> > > > > >> >
> >> >> > > > > >> > 5) This is related to 1/2) above: sometimes those
> >> non-fatal
> >> >> > errors
> >> >> > > > > like
> >> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not due to the
> state
> >> >> being
> >> >> > > > > >> corrupted
> >> >> > > > > >> > at the broker side, but maybe users are doing something
> >> >> wrong.
> >> >> > So
> >> >> > > > I'm
> >> >> > > > > >> > wondering if we should further distinguish those
> non-fatal
> >> >> > errors
> >> >> > > > > >> between
> >> >> > > > > >> > a) those that are caused by Kafka itself, e.g. a broker
> >> timed
> >> >> > out
> >> >> > > > and
> >> >> > > > > >> > aborted a txn and later an endTxn request is received,
> >> and b)
> >> >> > the
> >> >> > > > > user's
> >> >> > > > > >> > API call pattern is incorrect, causing the request to be
> >> >> > rejected
> >> >> > > > with
> >> >> > > > > >> an
> >> >> > > > > >> > error code from the broker.
> >> >> *TransactionStateCorruptedException*
> >> >> > > > feels
> >> >> > > > > >> to
> >> >> > > > > >> > me more like for case a), but not case b).
> >> >> > > > > >> >
> >> >> > > > > >> >
> >> >> > > > > >> > Guozhang
> >> >> > > > > >> >
> >> >> > > > > >> >
> >> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> >> >> > > > > reluctanthero104@gmail.com>
> >> >> > > > > >> > wrote:
> >> >> > > > > >> >
> >> >> > > > > >> >> Thanks Matthias, I think your proposal makes sense as
> >> well,
> >> >> on
> >> >> > > the
> >> >> > > > > pro
> >> >> > > > > >> side
> >> >> > > > > >> >> we could have a universally agreed exception type to be
> >> >> caught
> >> >> > by
> >> >> > > > the
> >> >> > > > > >> user,
> >> >> > > > > >> >> without having an extra layer on top of the actual
> >> >> exceptions.
> >> >> > I
> >> >> > > > > could
> >> >> > > > > >> see
> >> >> > > > > >> >> some issue on downsides:
> >> >> > > > > >> >>
> >> >> > > > > >> >> 1. The exception hierarchy will be more complex. Right
> >> now
> >> >> we
> >> >> > > have
> >> >> > > > > >> >> RetriableException type, if we are going to add a
> >> >> > > > > >> >> `ProducerRetriableException` type, we have to put this
> >> new
> >> >> > > > interface
> >> >> > > > > >> as the
> >> >> > > > > >> >> parent of the RetriableException, because not all
> thrown
> >> >> > > non-fatal
> >> >> > > > > >> >> exceptions are `retriable` in general for producer, for
> >> >> example
> >> >> > > > > >> >> <
> >> >> > > > > >> >>
> >> >> > > > > >>
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> >> >> > > > > >> >>> .
> >> >> > > > > >> >> We could have an empty interface
> >> >> `ProducerRetriableException`
> >> >> > to
> >> >> > > > let
> >> >> > > > > >> all
> >> >> > > > > >> >> the thrown exceptions implement for sure, but it's a
> bit
> >> >> > > unorthodox
> >> >> > > > > in
> >> >> > > > > >> the
> >> >> > > > > >> >> way we deal with exceptions in general.
> >> >> > > > > >> >>
> >> >> > > > > >> >> 2. There are cases where we throw a KafkaException
> >> wrapping
> >> >> > > another
> >> >> > > > > >> >> KafkaException as either fatal or non-fatal. If we use
> an
> >> >> > > interface
> >> >> > > > > to
> >> >> > > > > >> >> solve #1, it is also required to implement another
> >> bloated
> >> >> > > > exception
> >> >> > > > > >> class
> >> >> > > > > >> >> which could replace KafkaException type, as we couldn't
> >> mark
> >> >> > > > > >> KafkaException
> >> >> > > > > >> >> as retriable for sure.
> >> >> > > > > >> >>
> >> >> > > > > >> >> 3. In terms of the encapsulation, wrapping means we
> could
> >> >> limit
> >> >> > > the
> >> >> > > > > >> scope
> >> >> > > > > >> >> of affection to the producer only, which is important
> >> since
> >> >> we
> >> >> > > > don't
> >> >> > > > > >> want
> >> >> > > > > >> >> shared exception types to implement a producer-related
> >> >> > interface,
> >> >> > > > > such
> >> >> > > > > >> >> as UnknownTopicOrPartitionException.
> >> >> > > > > >> >>
> >> >> > > > > >> >> Best,
> >> >> > > > > >> >> Boyang
> >> >> > > > > >> >>
> >> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
> >> >> > mjsax@apache.org
> >> >> > > >
> >> >> > > > > >> wrote:
> >> >> > > > > >> >>
> >> >> > > > > >> >>> Thanks for the KIP Boyang!
> >> >> > > > > >> >>>
> >> >> > > > > >> >>> Overall, categorizing exceptions makes a lot of sense.
> >> As I
> >> >> > > don't
> >> >> > > > > know
> >> >> > > > > >> >>> the producer internals well enough, I cannot comment
> on
> >> the
> >> >> > > > > >> >>> categorization in detail though.
> >> >> > > > > >> >>>
> >> >> > > > > >> >>> What I am wondering is, if we should introduce an
> >> exception
> >> >> > > > > interface
> >> >> > > > > >> >>> that non-fatal exception would implement instead of
> >> >> creating a
> >> >> > > new
> >> >> > > > > >> class
> >> >> > > > > >> >>> that will wrap non-fatal exceptions? What would be the
> >> >> > pros/cons
> >> >> > > > for
> >> >> > > > > >> >>> both designs?
> >> >> > > > > >> >>>
> >> >> > > > > >> >>>
> >> >> > > > > >> >>> -Matthias
> >> >> > > > > >> >>>
> >> >> > > > > >> >>>
> >> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> >> >> > > > > >> >>>> Hey there,
> >> >> > > > > >> >>>>
> >> >> > > > > >> >>>> I would like to start a discussion thread for
> KIP-691:
> >> >> > > > > >> >>>>
> >> >> > > > > >> >>>
> >> >> > > > > >> >>
> >> >> > > > > >>
> >> >> > > > >
> >> >> > > >
> >> >> > >
> >> >> >
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> >> >> > > > > >> >>>>
> >> >> > > > > >> >>>> The KIP is aiming to simplify the exception handling
> >> logic
> >> >> > for
> >> >> > > > > >> >>>> transactional Producer users by classifying fatal and
> >> >> > non-fatal
> >> >> > > > > >> >>> exceptions
> >> >> > > > > >> >>>> and throw them correspondingly for easier catch and
> >> retry.
> >> >> > Let
> >> >> > > me
> >> >> > > > > >> know
> >> >> > > > > >> >>> what
> >> >> > > > > >> >>>> you think.
> >> >> > > > > >> >>>>
> >> >> > > > > >> >>>> Best,
> >> >> > > > > >> >>>> Boyang
> >> >> > > > > >> >>>>
> >> >> > > > > >> >>>
> >> >> > > > > >> >>
> >> >> > > > > >> >
> >> >> > > > > >> >
> >> >> > > > > >>
> >> >> > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > --
> >> >> > > > -- Guozhang
> >> >> > > >
> >> >> > >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > -- Guozhang
> >> >> >
> >> >>
> >> >
> >> >
> >> > --
> >> > -- Guozhang
> >> >
> >>
> >>
> >> --
> >> -- Guozhang
> >>
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Thanks for the clarification Guozhang, I got your point that we want to
have a consistent handling of fatal exceptions being thrown from the
abortTxn. I modified the current template to move the fatal exception
try-catch outside of the processing loop to make sure we could get a chance
to close consumer/producer modules. Let me know what you think.

Best,
Boyang

On Fri, Jan 22, 2021 at 11:05 AM Boyang Chen <re...@gmail.com>
wrote:

> My understanding is that abortTransaction would only throw when the
> producer is in fatal state. For CommitFailed, the producer should still be
> in the abortable error state, so that abortTransaction call would not throw.
>
> On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <wa...@gmail.com> wrote:
>
>> Or are you going to maintain some internal state such that, the
>> `abortTransaction` in the catch block would never throw again?
>>
>> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <wa...@gmail.com>
>> wrote:
>>
>> > Hi Boyang/Jason,
>> >
>> > I've also thought about this (i.e. using CommitFailed for all
>> non-fatal),
>> > but what I'm pondering is that, in the catch (CommitFailed) block, what
>> > would happen if the `producer.abortTransaction();` throws again? should
>> > that be captured as a fatal and cause the client to close again.
>> >
>> > If yes, then naively the pattern would be:
>> >
>> > ...
>> > catch (CommitFailedException e) {
>> >         // Transaction commit failed with abortable error, user could
>> reset
>> >         // the application state and resume with a new transaction. The
>> > root
>> >         // cause was wrapped in the thrown exception.
>> >         resetToLastCommittedPositions(consumer);
>> >         try {
>> >             producer.abortTransaction();
>> >         } catch (KafkaException e) {
>> >             producer.close();
>> >             consumer.close();
>> >             throw e;
>> >         }
>> >     } catch (KafkaException e) {
>> >         producer.close();
>> >         consumer.close();
>> >         throw e;
>> >     }
>> > ...
>> >
>> > Guozhang
>> >
>> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <
>> reluctanthero104@gmail.com>
>> > wrote:
>> >
>> >> Hey Guozhang,
>> >>
>> >> Jason and I were discussing the new API offline and decided to take
>> >> another
>> >> approach. Firstly, the reason not to invent a new API with returned
>> >> boolean
>> >> flag is for compatibility consideration, since old EOS code would not
>> know
>> >> that a given transaction commit was failed internally as they don't
>> listen
>> >> to the output flag. Our proposed alternative solution is to let
>> >> *commitTransaction
>> >> throw CommitFailedException whenever the commit failed with non-fatal
>> >> exception*, so that on the caller side the handling logic becomes:
>> >>
>> >> try {
>> >>         if (shouldCommit) {
>> >>             producer.commitTransaction();
>> >>         } else {
>> >>             resetToLastCommittedPositions(consumer);
>> >>             producer.abortTransaction();
>> >>         }
>> >>     } catch (CommitFailedException e) {
>> >>         // Transaction commit failed with abortable error, user could
>> >> reset
>> >>         // the application state and resume with a new transaction. The
>> >> root
>> >>         // cause was wrapped in the thrown exception.
>> >>         resetToLastCommittedPositions(consumer);
>> >>         producer.abortTransaction();
>> >>     } catch (KafkaException e) {
>> >>         producer.close();
>> >>         consumer.close();
>> >>         throw e;
>> >>     }
>> >>
>> >> This approach looks cleaner as all exception types other than
>> CommitFailed
>> >> will doom to be fatal, which is very easy to adopt for users. In the
>> >> meantime, we still maintain the commitTxn behavior to throw instead of
>> >> silently failing.
>> >>
>> >> In addition, we decided to drop the recommendation to handle
>> >> TimeoutException and leave it to the users to make the call. The
>> downside
>> >> for blindly calling abortTxn upon timeout is that we could result in an
>> >> illegal state when the commit was already successful on the broker
>> >> side. Without a good guarantee on the outcome, overcomplicating the
>> >> template should not be encouraged IMHO.
>> >>
>> >> Let me know your thoughts on the new approach here, thank you!
>> >>
>> >> Best,
>> >> Boyang
>> >>
>> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <wa...@gmail.com>
>> >> wrote:
>> >>
>> >> > Thanks for your clarification on 2)/3), that makes sense.
>> >> >
>> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
>> >> reluctanthero104@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > Thanks for the input Guozhang, replied inline.
>> >> > >
>> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <wa...@gmail.com>
>> >> > wrote:
>> >> > >
>> >> > > > Hello Boyang,
>> >> > > >
>> >> > > > Thanks for the updated KIP. I read it again and have the
>> following
>> >> > > > thoughts:
>> >> > > >
>> >> > > > 0. I'm a bit concerned that if commitTxn does not throw any
>> >> non-fatal
>> >> > > > exception, and instead we rely on the subsequent beginTxn call to
>> >> > throw,
>> >> > > it
>> >> > > > may violate some callers with a pattern that relying on
>> commitTxn to
>> >> > > > succeed to make some non-rollback operations. For example:
>> >> > > >
>> >> > > > beginTxn()
>> >> > > > // do some read-write on my local DB
>> >> > > > commitTxn()
>> >> > > > // if commitTxn succeeds, then commit the DB
>> >> > > >
>> >> > > > -------------
>> >> > > >
>> >> > > > The issue is that, committing DB is a non-rollback operation, and
>> >> users
>> >> > > may
>> >> > > > just rely on commitTxn to return without error to make this
>> >> > non-rollback
>> >> > > > call. Of course we can just claim this pattern is not legitimate
>> >> and is
>> >> > > not
>> >> > > > the right way of doing things, but many users may naturally adopt
>> >> this
>> >> > > > pattern.
>> >> > > >
>> >> > > > So maybe we should still let commitTxn also throw non-fatal
>> >> exceptions,
>> >> > > in
>> >> > > > which case we would then call abortTxn again.
>> >> > > >
>> >> > > > But if we do this, the pattern becomes:
>> >> > > >
>> >> > > > try {
>> >> > > >    beginTxn()
>> >> > > >    // do something
>> >> > > > } catch (Exception) {
>> >> > > >    shouldCommit = false;
>> >> > > > }
>> >> > > >
>> >> > > > if (shouldCommit) {
>> >> > > >     try {
>> >> > > >         commitTxn()
>> >> > > >     } catch (...) {        // enumerate all fatal exceptions
>> >> > > >         shutdown()
>> >> > > >     } catch (KafkaException) {
>> >> > > >         // non-fatal
>> >> > > >         shouldCommit = false;
>> >> > > >     }
>> >> > > > }
>> >> > > >
>> >> > > > if (!shouldCommit && running()) {
>> >> > > > try {
>> >> > > >         abortTxn()
>> >> > > >     } catch (KafkaException) {
>> >> > > >         // only throw fatal
>> >> > > >         shutdown()
>> >> > > >     }
>> >> > > > }
>> >> > > >
>> >> > > > ---------------------
>> >> > > >
>> >> > > > Which is much more complicated.
>> >> > > >
>> >> > > > Thank makes me think, the alternative we have discussed offline
>> may
>> >> be
>> >> > > > better: let commitTxn() to return a boolean flag.
>> >> > > >
>> >> > > > * If it returns true, it means the commit succeeded. Users can
>> >> > > comfortably
>> >> > > > continue and do any external non-rollback operations if they
>> like.
>> >> > > > * If it returns false, it means the commit failed with non-fatal
>> >> error
>> >> > > *AND
>> >> > > > the txn has been aborted*. Users do not need to call abortTxn
>> again.
>> >> > > > * If it throws, then it means fatal errors. Users should shut
>> down
>> >> the
>> >> > > > client.
>> >> > > >
>> >> > > > In this case, the pattern becomes:
>> >> > > >
>> >> > > > try {
>> >> > > >    beginTxn()
>> >> > > >    // do something
>> >> > > > } catch (Exception) {
>> >> > > >    shouldCommit = false;
>> >> > > > }
>> >> > > >
>> >> > > > try {
>> >> > > >     if (shouldCommit) {
>> >> > > >         commitSucceeded = commitTxn()
>> >> > > >     } else {
>> >> > > >         // reset offsets, rollback operations, etc
>> >> > > >         abortTxn()
>> >> > > >     }
>> >> > > > } catch (KafkaException) {
>> >> > > >     // only throw fatal
>> >> > > >     shutdown()
>> >> > > > }
>> >> > > >
>> >> > > > if (commitSucceeded)
>> >> > > >    // do other non-rollbackable things
>> >> > > > else
>> >> > > >    // reset offsets, rollback operations, etc
>> >> > > >
>> >> > > > -------------------------
>> >> > > >
>> >> > > > Of course, if we want to have better visibility into what caused
>> the
>> >> > > commit
>> >> > > > to fail and txn to abort. We can let the return type be an enum
>> >> instead
>> >> > > of
>> >> > > > a flag. But my main idea is to still let the commitTxn be the
>> final
>> >> > point
>> >> > > > users can tell whether this txn succeeded or not, instead of
>> >> relying on
>> >> > > the
>> >> > > > next beginTxn() call.
>> >> > > >
>> >> > > > I agree that adding a boolean flag is indeed useful in this case.
>> >> Will
>> >> > > update the KIP.
>> >> > >
>> >> > > 1. Re: "while maintaining the behavior to throw fatal exception in
>> >> raw"
>> >> > not
>> >> > > > sure what do you mean by "throw" here. Are you proposing the
>> >> callback
>> >> > > would
>> >> > > > *pass* (not throw) in any fatal exceptions when triggered without
>> >> > > wrapping?
>> >> > > >
>> >> > > > Yes, I want to say *pass*, the benefit is to make the end user's
>> >> > > expectation consistent
>> >> > > regarding exception handling.
>> >> > >
>> >> > >
>> >> > > > 2. I'm not sure I understand the claim regarding the callback
>> that
>> >> "In
>> >> > > EOS
>> >> > > > setup, it is not required to handle the exception". Are you
>> >> proposing
>> >> > > that,
>> >> > > > e.g. in Streams, we do not try to handle any exceptions if EOS is
>> >> > enabled
>> >> > > > in the callback anymore, but just let commitTxn() itself to fail
>> to
>> >> be
>> >> > > > notified about the problem? Personally I think in Streams we
>> should
>> >> > just
>> >> > > > make the handling logic of the callback to be consistent
>> regardless
>> >> of
>> >> > > the
>> >> > > > EOS settings (today we have different logic depending on this
>> logic,
>> >> > > which
>> >> > > > I think could be unified as well).
>> >> > > >
>> >> > > > My idea originates from the claim on send API:
>> >> > > "When used as part of a transaction, it is not necessary to define
>> a
>> >> > > callback or check the result of the future  in order to detect
>> errors
>> >> > from
>> >> > > <code>send</code>. "
>> >> > > My understanding is that for EOS, the exception will be detected by
>> >> > calling
>> >> > > transactional APIs either way, so it is a duplicate handling to
>> track
>> >> > > all the sendExceptions in RecordCollector. However, I looked up
>> >> > > sendException is being used today as follow:
>> >> > >
>> >> > > 1. Pass to "ProductionExceptionHandler" for any default or
>> customized
>> >> > > exception handler to handle
>> >> > > 2. Stop collecting offset info or new exceptions
>> >> > > 3. Check and rethrow exceptions in close(), flush() or new send()
>> >> calls
>> >> > >
>> >> > > To my understanding, we should still honor the commitment to #1 for
>> >> any
>> >> > > user customized implementation. The #2 does not really affect our
>> >> > decision
>> >> > > upon EOS. The #3 here is still valuable as it could help us fail
>> fast
>> >> in
>> >> > > new send() instead of waiting to later stage of processing. In that
>> >> > sense,
>> >> > > I agree we should continue to record send exceptions even under EOS
>> >> case
>> >> > to
>> >> > > ensure the strength of stream side Producer logic. On the safer
>> side,
>> >> we
>> >> > no
>> >> > > longer need to wrap certain fatal exceptions like ProducerFenced as
>> >> > > TaskMigrated, since they should not crash the stream thread if
>> thrown
>> >> in
>> >> > > raw format, once we adopt the new processing model in the send
>> phase.
>> >> > >
>> >> > >
>> >> > > >
>> >> > > > Guozhang
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
>> >> > reluctanthero104@gmail.com>
>> >> > > > wrote:
>> >> > > >
>> >> > > > > Thanks for everyone's feedback so far. I have polished the KIP
>> >> after
>> >> > > > > offline discussion with some folks working on EOS to make the
>> >> > exception
>> >> > > > > handling more lightweight. The essential change is that we are
>> not
>> >> > > > > inventing a new intermediate exception type, but instead
>> >> separating a
>> >> > > > full
>> >> > > > > transaction into two phases:
>> >> > > > >
>> >> > > > > 1. The data transmission phase
>> >> > > > > 2. The commit phase
>> >> > > > >
>> >> > > > > This way for any exception thrown from phase 1, will be an
>> >> indicator
>> >> > in
>> >> > > > > phase 2 whether we should do commit or abort, and from now on
>> >> > > > > `commitTransaction` should only throw fatal exceptions,
>> similar to
>> >> > > > > `abortTransaction`, so that any KafkaException caught in the
>> >> commit
>> >> > > phase
>> >> > > > > will be definitely fatal to crash the app. For more advanced
>> users
>> >> > such
>> >> > > > as
>> >> > > > > Streams, we have the ability to further wrap selected types of
>> >> fatal
>> >> > > > > exceptions to trigger task migration if necessary.
>> >> > > > >
>> >> > > > > More details in the KIP, feel free to take another look,
>> thanks!
>> >> > > > >
>> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
>> >> > > reluctanthero104@gmail.com>
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Thanks Bruno for the feedback.
>> >> > > > > >
>> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
>> >> bruno@confluent.io>
>> >> > > > wrote:
>> >> > > > > >
>> >> > > > > >> Thanks Boyang for the KIP!
>> >> > > > > >>
>> >> > > > > >> Like Matthias, I do also not know the producer internal well
>> >> > enough
>> >> > > to
>> >> > > > > >> comment on the categorization. However, I think having a
>> super
>> >> > > > exception
>> >> > > > > >> (e.g. RetriableException) that  encodes if an exception is
>> >> fatal
>> >> > or
>> >> > > > not
>> >> > > > > >> is cleaner, better understandable and less error-prone,
>> because
>> >> > > > ideally
>> >> > > > > >> when you add a new non-fatal exception in future you just
>> need
>> >> to
>> >> > > > think
>> >> > > > > >> about letting it inherit from the super exception and all
>> the
>> >> rest
>> >> > > of
>> >> > > > > >> the code will just behave correctly without the need to wrap
>> >> the
>> >> > new
>> >> > > > > >> exception into another exception each time it is thrown
>> (maybe
>> >> it
>> >> > is
>> >> > > > > >> thrown at different location in the code).
>> >> > > > > >>
>> >> > > > > >> As far as I understand the following statement from your
>> >> previous
>> >> > > > e-mail
>> >> > > > > >> is the reason that currently such a super exception is not
>> >> > possible:
>> >> > > > > >>
>> >> > > > > >> "Right now we have RetriableException type, if we are going
>> to
>> >> > add a
>> >> > > > > >> `ProducerRetriableException` type, we have to put this new
>> >> > interface
>> >> > > > as
>> >> > > > > >> the parent of the RetriableException, because not all thrown
>> >> > > non-fatal
>> >> > > > > >> exceptions are `retriable` in general for producer"
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >> In the list of exceptions in your KIP, I found non-fatal
>> >> > exceptions
>> >> > > > that
>> >> > > > > >> do not inherit from RetriableException. I guess those are
>> the
>> >> ones
>> >> > > you
>> >> > > > > >> are referring to in your statement:
>> >> > > > > >>
>> >> > > > > >> InvalidProducerEpochException
>> >> > > > > >> InvalidPidMappingException
>> >> > > > > >> TransactionAbortedException
>> >> > > > > >>
>> >> > > > > >> All of those exceptions are non-fatal and do not inherit
>> from
>> >> > > > > >> RetriableException. Is there a reason for that? If they
>> >> depended
>> >> > > from
>> >> > > > > >> RetriableException we would be a bit closer to a super
>> >> exception I
>> >> > > > > >> mention above.
>> >> > > > > >>
>> >> > > > > >> The reason is that sender may catch those exceptions in the
>> >> > > > > > ProduceResponse, and it currently does infinite
>> >> > > > > > retries on RetriableException. To make sure we could trigger
>> the
>> >> > > > > > abortTransaction() by catching non-fatal thrown
>> >> > > > > > exceptions and reinitialize the txn state, we chose not to
>> let
>> >> > those
>> >> > > > > > exceptions inherit RetriableException so that
>> >> > > > > > they won't cause infinite retry on sender.
>> >> > > > > >
>> >> > > > > >
>> >> > > > > >> With OutOfOrderSequenceException and
>> >> UnknownProducerIdException, I
>> >> > > > think
>> >> > > > > >> to understand that their fatality depends on the type (i.e.
>> >> > > > > >> configuration) of the producer. That makes it difficult to
>> >> have a
>> >> > > > super
>> >> > > > > >> exception that encodes the retriability as mentioned above.
>> >> Would
>> >> > it
>> >> > > > be
>> >> > > > > >> possible to introduce exceptions that inherit from
>> >> > > RetriableException
>> >> > > > > >> and that are thrown when those exceptions are caught from
>> the
>> >> > > brokers
>> >> > > > > >> and the type of the producer is such that the exceptions are
>> >> > > > retriable?
>> >> > > > > >>
>> >> > > > > >> Yea, I think in general the exception type mixing is
>> difficult
>> >> to
>> >> > > get
>> >> > > > > > them all right. I have already proposed another solution
>> based
>> >> on
>> >> > my
>> >> > > > > > offline discussion with some folks working on EOS
>> >> > > > > > to make the handling more straightforward for end users
>> without
>> >> the
>> >> > > > need
>> >> > > > > > to distinguish exception fatality.
>> >> > > > > >
>> >> > > > > >> Best,
>> >> > > > > >> Bruno
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
>> >> > > > > >> > Thanks Boyang for the proposal! I made a pass over the
>> list
>> >> and
>> >> > > here
>> >> > > > > are
>> >> > > > > >> > some thoughts:
>> >> > > > > >> >
>> >> > > > > >> > 0) Although this is not part of the public API, I think we
>> >> > should
>> >> > > > make
>> >> > > > > >> sure
>> >> > > > > >> > that the suggested pattern (i.e. user can always call
>> >> abortTxn()
>> >> > > > when
>> >> > > > > >> > handling non-fatal errors) are indeed supported. E.g. if
>> the
>> >> txn
>> >> > > is
>> >> > > > > >> already
>> >> > > > > >> > aborted by the broker side, then users can still call
>> >> abortTxn
>> >> > > which
>> >> > > > > >> would
>> >> > > > > >> > not throw another exception but just be treated as a
>> no-op.
>> >> > > > > >> >
>> >> > > > > >> > 1) *ConcurrentTransactionsException*: I think this error
>> can
>> >> > also
>> >> > > be
>> >> > > > > >> > returned but not documented yet. This should be a
>> non-fatal
>> >> > error.
>> >> > > > > >> >
>> >> > > > > >> > 2) *InvalidTxnStateException*: this error is returned from
>> >> > broker
>> >> > > > when
>> >> > > > > >> txn
>> >> > > > > >> > state transition failed (e.g. it is trying to transit to
>> >> > > > > complete-commit
>> >> > > > > >> > while the current state is not prepare-commit). This error
>> >> could
>> >> > > > > >> indicates
>> >> > > > > >> > a bug on the client internal code or the broker code, OR a
>> >> user
>> >> > > > error
>> >> > > > > >> --- a
>> >> > > > > >> > similar error is ConcurrentTransactionsException, i.e. if
>> >> Kafka
>> >> > is
>> >> > > > > >> bug-free
>> >> > > > > >> > these exceptions would only be returned if users try to do
>> >> > > something
>> >> > > > > >> wrong,
>> >> > > > > >> > e.g. calling abortTxn right after a commitTxn, etc. So I'm
>> >> > > thinking
>> >> > > > it
>> >> > > > > >> > should be a non-fatal error instead of a fatal error,
>> wdyt?
>> >> > > > > >> >
>> >> > > > > >> > 3) *KafkaException*: case i "indicates fatal transactional
>> >> > > sequence
>> >> > > > > >> > (Fatal)", this is a bit conflicting with the
>> >> > > > *OutOfSequenceException*
>> >> > > > > >> that
>> >> > > > > >> > is treated as non-fatal. I guess your proposal is that
>> >> > > > > >> > OutOfOrderSequenceException would be treated either as
>> fatal
>> >> > with
>> >> > > > > >> > transactional producer, or non-fatal with idempotent
>> >> producer,
>> >> > is
>> >> > > > that
>> >> > > > > >> > right? If the producer is only configured with idempotency
>> >> but
>> >> > not
>> >> > > > > >> > transaction, then throwing a
>> >> TransactionStateCorruptedException
>> >> > > for
>> >> > > > > >> > non-fatal errors would be confusing since users are not
>> using
>> >> > > > > >> transactions
>> >> > > > > >> > at all.. So I suggest we always throw
>> OutOfSequenceException
>> >> > as-is
>> >> > > > > (i.e.
>> >> > > > > >> > not wrapped) no matter how the producer is configured, and
>> >> let
>> >> > the
>> >> > > > > >> caller
>> >> > > > > >> > decide how to handle it based on whether it is only
>> >> idempotent
>> >> > or
>> >> > > > > >> > transactional itself.
>> >> > > > > >> >
>> >> > > > > >> > 4) Besides all the txn APIs, the `send()` callback /
>> future
>> >> can
>> >> > > also
>> >> > > > > >> throw
>> >> > > > > >> > txn-related exceptions, I think this KIP should also cover
>> >> this
>> >> > > API
>> >> > > > as
>> >> > > > > >> well?
>> >> > > > > >> >
>> >> > > > > >> > 5) This is related to 1/2) above: sometimes those
>> non-fatal
>> >> > errors
>> >> > > > > like
>> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not due to the state
>> >> being
>> >> > > > > >> corrupted
>> >> > > > > >> > at the broker side, but maybe users are doing something
>> >> wrong.
>> >> > So
>> >> > > > I'm
>> >> > > > > >> > wondering if we should further distinguish those non-fatal
>> >> > errors
>> >> > > > > >> between
>> >> > > > > >> > a) those that are caused by Kafka itself, e.g. a broker
>> timed
>> >> > out
>> >> > > > and
>> >> > > > > >> > aborted a txn and later an endTxn request is received,
>> and b)
>> >> > the
>> >> > > > > user's
>> >> > > > > >> > API call pattern is incorrect, causing the request to be
>> >> > rejected
>> >> > > > with
>> >> > > > > >> an
>> >> > > > > >> > error code from the broker.
>> >> *TransactionStateCorruptedException*
>> >> > > > feels
>> >> > > > > >> to
>> >> > > > > >> > me more like for case a), but not case b).
>> >> > > > > >> >
>> >> > > > > >> >
>> >> > > > > >> > Guozhang
>> >> > > > > >> >
>> >> > > > > >> >
>> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
>> >> > > > > reluctanthero104@gmail.com>
>> >> > > > > >> > wrote:
>> >> > > > > >> >
>> >> > > > > >> >> Thanks Matthias, I think your proposal makes sense as
>> well,
>> >> on
>> >> > > the
>> >> > > > > pro
>> >> > > > > >> side
>> >> > > > > >> >> we could have a universally agreed exception type to be
>> >> caught
>> >> > by
>> >> > > > the
>> >> > > > > >> user,
>> >> > > > > >> >> without having an extra layer on top of the actual
>> >> exceptions.
>> >> > I
>> >> > > > > could
>> >> > > > > >> see
>> >> > > > > >> >> some issue on downsides:
>> >> > > > > >> >>
>> >> > > > > >> >> 1. The exception hierarchy will be more complex. Right
>> now
>> >> we
>> >> > > have
>> >> > > > > >> >> RetriableException type, if we are going to add a
>> >> > > > > >> >> `ProducerRetriableException` type, we have to put this
>> new
>> >> > > > interface
>> >> > > > > >> as the
>> >> > > > > >> >> parent of the RetriableException, because not all thrown
>> >> > > non-fatal
>> >> > > > > >> >> exceptions are `retriable` in general for producer, for
>> >> example
>> >> > > > > >> >> <
>> >> > > > > >> >>
>> >> > > > > >>
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
>> >> > > > > >> >>> .
>> >> > > > > >> >> We could have an empty interface
>> >> `ProducerRetriableException`
>> >> > to
>> >> > > > let
>> >> > > > > >> all
>> >> > > > > >> >> the thrown exceptions implement for sure, but it's a bit
>> >> > > unorthodox
>> >> > > > > in
>> >> > > > > >> the
>> >> > > > > >> >> way we deal with exceptions in general.
>> >> > > > > >> >>
>> >> > > > > >> >> 2. There are cases where we throw a KafkaException
>> wrapping
>> >> > > another
>> >> > > > > >> >> KafkaException as either fatal or non-fatal. If we use an
>> >> > > interface
>> >> > > > > to
>> >> > > > > >> >> solve #1, it is also required to implement another
>> bloated
>> >> > > > exception
>> >> > > > > >> class
>> >> > > > > >> >> which could replace KafkaException type, as we couldn't
>> mark
>> >> > > > > >> KafkaException
>> >> > > > > >> >> as retriable for sure.
>> >> > > > > >> >>
>> >> > > > > >> >> 3. In terms of the encapsulation, wrapping means we could
>> >> limit
>> >> > > the
>> >> > > > > >> scope
>> >> > > > > >> >> of affection to the producer only, which is important
>> since
>> >> we
>> >> > > > don't
>> >> > > > > >> want
>> >> > > > > >> >> shared exception types to implement a producer-related
>> >> > interface,
>> >> > > > > such
>> >> > > > > >> >> as UnknownTopicOrPartitionException.
>> >> > > > > >> >>
>> >> > > > > >> >> Best,
>> >> > > > > >> >> Boyang
>> >> > > > > >> >>
>> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
>> >> > mjsax@apache.org
>> >> > > >
>> >> > > > > >> wrote:
>> >> > > > > >> >>
>> >> > > > > >> >>> Thanks for the KIP Boyang!
>> >> > > > > >> >>>
>> >> > > > > >> >>> Overall, categorizing exceptions makes a lot of sense.
>> As I
>> >> > > don't
>> >> > > > > know
>> >> > > > > >> >>> the producer internals well enough, I cannot comment on
>> the
>> >> > > > > >> >>> categorization in detail though.
>> >> > > > > >> >>>
>> >> > > > > >> >>> What I am wondering is, if we should introduce an
>> exception
>> >> > > > > interface
>> >> > > > > >> >>> that non-fatal exception would implement instead of
>> >> creating a
>> >> > > new
>> >> > > > > >> class
>> >> > > > > >> >>> that will wrap non-fatal exceptions? What would be the
>> >> > pros/cons
>> >> > > > for
>> >> > > > > >> >>> both designs?
>> >> > > > > >> >>>
>> >> > > > > >> >>>
>> >> > > > > >> >>> -Matthias
>> >> > > > > >> >>>
>> >> > > > > >> >>>
>> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
>> >> > > > > >> >>>> Hey there,
>> >> > > > > >> >>>>
>> >> > > > > >> >>>> I would like to start a discussion thread for KIP-691:
>> >> > > > > >> >>>>
>> >> > > > > >> >>>
>> >> > > > > >> >>
>> >> > > > > >>
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
>> >> > > > > >> >>>>
>> >> > > > > >> >>>> The KIP is aiming to simplify the exception handling
>> logic
>> >> > for
>> >> > > > > >> >>>> transactional Producer users by classifying fatal and
>> >> > non-fatal
>> >> > > > > >> >>> exceptions
>> >> > > > > >> >>>> and throw them correspondingly for easier catch and
>> retry.
>> >> > Let
>> >> > > me
>> >> > > > > >> know
>> >> > > > > >> >>> what
>> >> > > > > >> >>>> you think.
>> >> > > > > >> >>>>
>> >> > > > > >> >>>> Best,
>> >> > > > > >> >>>> Boyang
>> >> > > > > >> >>>>
>> >> > > > > >> >>>
>> >> > > > > >> >>
>> >> > > > > >> >
>> >> > > > > >> >
>> >> > > > > >>
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > > -- Guozhang
>> >> > > >
>> >> > >
>> >> >
>> >> >
>> >> > --
>> >> > -- Guozhang
>> >> >
>> >>
>> >
>> >
>> > --
>> > -- Guozhang
>> >
>>
>>
>> --
>> -- Guozhang
>>
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
My understanding is that abortTransaction would only throw when the
producer is in fatal state. For CommitFailed, the producer should still be
in the abortable error state, so that abortTransaction call would not throw.

On Fri, Jan 22, 2021 at 11:02 AM Guozhang Wang <wa...@gmail.com> wrote:

> Or are you going to maintain some internal state such that, the
> `abortTransaction` in the catch block would never throw again?
>
> On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hi Boyang/Jason,
> >
> > I've also thought about this (i.e. using CommitFailed for all non-fatal),
> > but what I'm pondering is that, in the catch (CommitFailed) block, what
> > would happen if the `producer.abortTransaction();` throws again? should
> > that be captured as a fatal and cause the client to close again.
> >
> > If yes, then naively the pattern would be:
> >
> > ...
> > catch (CommitFailedException e) {
> >         // Transaction commit failed with abortable error, user could
> reset
> >         // the application state and resume with a new transaction. The
> > root
> >         // cause was wrapped in the thrown exception.
> >         resetToLastCommittedPositions(consumer);
> >         try {
> >             producer.abortTransaction();
> >         } catch (KafkaException e) {
> >             producer.close();
> >             consumer.close();
> >             throw e;
> >         }
> >     } catch (KafkaException e) {
> >         producer.close();
> >         consumer.close();
> >         throw e;
> >     }
> > ...
> >
> > Guozhang
> >
> > On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <reluctanthero104@gmail.com
> >
> > wrote:
> >
> >> Hey Guozhang,
> >>
> >> Jason and I were discussing the new API offline and decided to take
> >> another
> >> approach. Firstly, the reason not to invent a new API with returned
> >> boolean
> >> flag is for compatibility consideration, since old EOS code would not
> know
> >> that a given transaction commit was failed internally as they don't
> listen
> >> to the output flag. Our proposed alternative solution is to let
> >> *commitTransaction
> >> throw CommitFailedException whenever the commit failed with non-fatal
> >> exception*, so that on the caller side the handling logic becomes:
> >>
> >> try {
> >>         if (shouldCommit) {
> >>             producer.commitTransaction();
> >>         } else {
> >>             resetToLastCommittedPositions(consumer);
> >>             producer.abortTransaction();
> >>         }
> >>     } catch (CommitFailedException e) {
> >>         // Transaction commit failed with abortable error, user could
> >> reset
> >>         // the application state and resume with a new transaction. The
> >> root
> >>         // cause was wrapped in the thrown exception.
> >>         resetToLastCommittedPositions(consumer);
> >>         producer.abortTransaction();
> >>     } catch (KafkaException e) {
> >>         producer.close();
> >>         consumer.close();
> >>         throw e;
> >>     }
> >>
> >> This approach looks cleaner as all exception types other than
> CommitFailed
> >> will doom to be fatal, which is very easy to adopt for users. In the
> >> meantime, we still maintain the commitTxn behavior to throw instead of
> >> silently failing.
> >>
> >> In addition, we decided to drop the recommendation to handle
> >> TimeoutException and leave it to the users to make the call. The
> downside
> >> for blindly calling abortTxn upon timeout is that we could result in an
> >> illegal state when the commit was already successful on the broker
> >> side. Without a good guarantee on the outcome, overcomplicating the
> >> template should not be encouraged IMHO.
> >>
> >> Let me know your thoughts on the new approach here, thank you!
> >>
> >> Best,
> >> Boyang
> >>
> >> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <wa...@gmail.com>
> >> wrote:
> >>
> >> > Thanks for your clarification on 2)/3), that makes sense.
> >> >
> >> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
> >> reluctanthero104@gmail.com>
> >> > wrote:
> >> >
> >> > > Thanks for the input Guozhang, replied inline.
> >> > >
> >> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <wa...@gmail.com>
> >> > wrote:
> >> > >
> >> > > > Hello Boyang,
> >> > > >
> >> > > > Thanks for the updated KIP. I read it again and have the following
> >> > > > thoughts:
> >> > > >
> >> > > > 0. I'm a bit concerned that if commitTxn does not throw any
> >> non-fatal
> >> > > > exception, and instead we rely on the subsequent beginTxn call to
> >> > throw,
> >> > > it
> >> > > > may violate some callers with a pattern that relying on commitTxn
> to
> >> > > > succeed to make some non-rollback operations. For example:
> >> > > >
> >> > > > beginTxn()
> >> > > > // do some read-write on my local DB
> >> > > > commitTxn()
> >> > > > // if commitTxn succeeds, then commit the DB
> >> > > >
> >> > > > -------------
> >> > > >
> >> > > > The issue is that, committing DB is a non-rollback operation, and
> >> users
> >> > > may
> >> > > > just rely on commitTxn to return without error to make this
> >> > non-rollback
> >> > > > call. Of course we can just claim this pattern is not legitimate
> >> and is
> >> > > not
> >> > > > the right way of doing things, but many users may naturally adopt
> >> this
> >> > > > pattern.
> >> > > >
> >> > > > So maybe we should still let commitTxn also throw non-fatal
> >> exceptions,
> >> > > in
> >> > > > which case we would then call abortTxn again.
> >> > > >
> >> > > > But if we do this, the pattern becomes:
> >> > > >
> >> > > > try {
> >> > > >    beginTxn()
> >> > > >    // do something
> >> > > > } catch (Exception) {
> >> > > >    shouldCommit = false;
> >> > > > }
> >> > > >
> >> > > > if (shouldCommit) {
> >> > > >     try {
> >> > > >         commitTxn()
> >> > > >     } catch (...) {        // enumerate all fatal exceptions
> >> > > >         shutdown()
> >> > > >     } catch (KafkaException) {
> >> > > >         // non-fatal
> >> > > >         shouldCommit = false;
> >> > > >     }
> >> > > > }
> >> > > >
> >> > > > if (!shouldCommit && running()) {
> >> > > > try {
> >> > > >         abortTxn()
> >> > > >     } catch (KafkaException) {
> >> > > >         // only throw fatal
> >> > > >         shutdown()
> >> > > >     }
> >> > > > }
> >> > > >
> >> > > > ---------------------
> >> > > >
> >> > > > Which is much more complicated.
> >> > > >
> >> > > > Thank makes me think, the alternative we have discussed offline
> may
> >> be
> >> > > > better: let commitTxn() to return a boolean flag.
> >> > > >
> >> > > > * If it returns true, it means the commit succeeded. Users can
> >> > > comfortably
> >> > > > continue and do any external non-rollback operations if they like.
> >> > > > * If it returns false, it means the commit failed with non-fatal
> >> error
> >> > > *AND
> >> > > > the txn has been aborted*. Users do not need to call abortTxn
> again.
> >> > > > * If it throws, then it means fatal errors. Users should shut down
> >> the
> >> > > > client.
> >> > > >
> >> > > > In this case, the pattern becomes:
> >> > > >
> >> > > > try {
> >> > > >    beginTxn()
> >> > > >    // do something
> >> > > > } catch (Exception) {
> >> > > >    shouldCommit = false;
> >> > > > }
> >> > > >
> >> > > > try {
> >> > > >     if (shouldCommit) {
> >> > > >         commitSucceeded = commitTxn()
> >> > > >     } else {
> >> > > >         // reset offsets, rollback operations, etc
> >> > > >         abortTxn()
> >> > > >     }
> >> > > > } catch (KafkaException) {
> >> > > >     // only throw fatal
> >> > > >     shutdown()
> >> > > > }
> >> > > >
> >> > > > if (commitSucceeded)
> >> > > >    // do other non-rollbackable things
> >> > > > else
> >> > > >    // reset offsets, rollback operations, etc
> >> > > >
> >> > > > -------------------------
> >> > > >
> >> > > > Of course, if we want to have better visibility into what caused
> the
> >> > > commit
> >> > > > to fail and txn to abort. We can let the return type be an enum
> >> instead
> >> > > of
> >> > > > a flag. But my main idea is to still let the commitTxn be the
> final
> >> > point
> >> > > > users can tell whether this txn succeeded or not, instead of
> >> relying on
> >> > > the
> >> > > > next beginTxn() call.
> >> > > >
> >> > > > I agree that adding a boolean flag is indeed useful in this case.
> >> Will
> >> > > update the KIP.
> >> > >
> >> > > 1. Re: "while maintaining the behavior to throw fatal exception in
> >> raw"
> >> > not
> >> > > > sure what do you mean by "throw" here. Are you proposing the
> >> callback
> >> > > would
> >> > > > *pass* (not throw) in any fatal exceptions when triggered without
> >> > > wrapping?
> >> > > >
> >> > > > Yes, I want to say *pass*, the benefit is to make the end user's
> >> > > expectation consistent
> >> > > regarding exception handling.
> >> > >
> >> > >
> >> > > > 2. I'm not sure I understand the claim regarding the callback that
> >> "In
> >> > > EOS
> >> > > > setup, it is not required to handle the exception". Are you
> >> proposing
> >> > > that,
> >> > > > e.g. in Streams, we do not try to handle any exceptions if EOS is
> >> > enabled
> >> > > > in the callback anymore, but just let commitTxn() itself to fail
> to
> >> be
> >> > > > notified about the problem? Personally I think in Streams we
> should
> >> > just
> >> > > > make the handling logic of the callback to be consistent
> regardless
> >> of
> >> > > the
> >> > > > EOS settings (today we have different logic depending on this
> logic,
> >> > > which
> >> > > > I think could be unified as well).
> >> > > >
> >> > > > My idea originates from the claim on send API:
> >> > > "When used as part of a transaction, it is not necessary to define a
> >> > > callback or check the result of the future  in order to detect
> errors
> >> > from
> >> > > <code>send</code>. "
> >> > > My understanding is that for EOS, the exception will be detected by
> >> > calling
> >> > > transactional APIs either way, so it is a duplicate handling to
> track
> >> > > all the sendExceptions in RecordCollector. However, I looked up
> >> > > sendException is being used today as follow:
> >> > >
> >> > > 1. Pass to "ProductionExceptionHandler" for any default or
> customized
> >> > > exception handler to handle
> >> > > 2. Stop collecting offset info or new exceptions
> >> > > 3. Check and rethrow exceptions in close(), flush() or new send()
> >> calls
> >> > >
> >> > > To my understanding, we should still honor the commitment to #1 for
> >> any
> >> > > user customized implementation. The #2 does not really affect our
> >> > decision
> >> > > upon EOS. The #3 here is still valuable as it could help us fail
> fast
> >> in
> >> > > new send() instead of waiting to later stage of processing. In that
> >> > sense,
> >> > > I agree we should continue to record send exceptions even under EOS
> >> case
> >> > to
> >> > > ensure the strength of stream side Producer logic. On the safer
> side,
> >> we
> >> > no
> >> > > longer need to wrap certain fatal exceptions like ProducerFenced as
> >> > > TaskMigrated, since they should not crash the stream thread if
> thrown
> >> in
> >> > > raw format, once we adopt the new processing model in the send
> phase.
> >> > >
> >> > >
> >> > > >
> >> > > > Guozhang
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> >> > reluctanthero104@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > Thanks for everyone's feedback so far. I have polished the KIP
> >> after
> >> > > > > offline discussion with some folks working on EOS to make the
> >> > exception
> >> > > > > handling more lightweight. The essential change is that we are
> not
> >> > > > > inventing a new intermediate exception type, but instead
> >> separating a
> >> > > > full
> >> > > > > transaction into two phases:
> >> > > > >
> >> > > > > 1. The data transmission phase
> >> > > > > 2. The commit phase
> >> > > > >
> >> > > > > This way for any exception thrown from phase 1, will be an
> >> indicator
> >> > in
> >> > > > > phase 2 whether we should do commit or abort, and from now on
> >> > > > > `commitTransaction` should only throw fatal exceptions, similar
> to
> >> > > > > `abortTransaction`, so that any KafkaException caught in the
> >> commit
> >> > > phase
> >> > > > > will be definitely fatal to crash the app. For more advanced
> users
> >> > such
> >> > > > as
> >> > > > > Streams, we have the ability to further wrap selected types of
> >> fatal
> >> > > > > exceptions to trigger task migration if necessary.
> >> > > > >
> >> > > > > More details in the KIP, feel free to take another look, thanks!
> >> > > > >
> >> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> >> > > reluctanthero104@gmail.com>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Thanks Bruno for the feedback.
> >> > > > > >
> >> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
> >> bruno@confluent.io>
> >> > > > wrote:
> >> > > > > >
> >> > > > > >> Thanks Boyang for the KIP!
> >> > > > > >>
> >> > > > > >> Like Matthias, I do also not know the producer internal well
> >> > enough
> >> > > to
> >> > > > > >> comment on the categorization. However, I think having a
> super
> >> > > > exception
> >> > > > > >> (e.g. RetriableException) that  encodes if an exception is
> >> fatal
> >> > or
> >> > > > not
> >> > > > > >> is cleaner, better understandable and less error-prone,
> because
> >> > > > ideally
> >> > > > > >> when you add a new non-fatal exception in future you just
> need
> >> to
> >> > > > think
> >> > > > > >> about letting it inherit from the super exception and all the
> >> rest
> >> > > of
> >> > > > > >> the code will just behave correctly without the need to wrap
> >> the
> >> > new
> >> > > > > >> exception into another exception each time it is thrown
> (maybe
> >> it
> >> > is
> >> > > > > >> thrown at different location in the code).
> >> > > > > >>
> >> > > > > >> As far as I understand the following statement from your
> >> previous
> >> > > > e-mail
> >> > > > > >> is the reason that currently such a super exception is not
> >> > possible:
> >> > > > > >>
> >> > > > > >> "Right now we have RetriableException type, if we are going
> to
> >> > add a
> >> > > > > >> `ProducerRetriableException` type, we have to put this new
> >> > interface
> >> > > > as
> >> > > > > >> the parent of the RetriableException, because not all thrown
> >> > > non-fatal
> >> > > > > >> exceptions are `retriable` in general for producer"
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> In the list of exceptions in your KIP, I found non-fatal
> >> > exceptions
> >> > > > that
> >> > > > > >> do not inherit from RetriableException. I guess those are the
> >> ones
> >> > > you
> >> > > > > >> are referring to in your statement:
> >> > > > > >>
> >> > > > > >> InvalidProducerEpochException
> >> > > > > >> InvalidPidMappingException
> >> > > > > >> TransactionAbortedException
> >> > > > > >>
> >> > > > > >> All of those exceptions are non-fatal and do not inherit from
> >> > > > > >> RetriableException. Is there a reason for that? If they
> >> depended
> >> > > from
> >> > > > > >> RetriableException we would be a bit closer to a super
> >> exception I
> >> > > > > >> mention above.
> >> > > > > >>
> >> > > > > >> The reason is that sender may catch those exceptions in the
> >> > > > > > ProduceResponse, and it currently does infinite
> >> > > > > > retries on RetriableException. To make sure we could trigger
> the
> >> > > > > > abortTransaction() by catching non-fatal thrown
> >> > > > > > exceptions and reinitialize the txn state, we chose not to let
> >> > those
> >> > > > > > exceptions inherit RetriableException so that
> >> > > > > > they won't cause infinite retry on sender.
> >> > > > > >
> >> > > > > >
> >> > > > > >> With OutOfOrderSequenceException and
> >> UnknownProducerIdException, I
> >> > > > think
> >> > > > > >> to understand that their fatality depends on the type (i.e.
> >> > > > > >> configuration) of the producer. That makes it difficult to
> >> have a
> >> > > > super
> >> > > > > >> exception that encodes the retriability as mentioned above.
> >> Would
> >> > it
> >> > > > be
> >> > > > > >> possible to introduce exceptions that inherit from
> >> > > RetriableException
> >> > > > > >> and that are thrown when those exceptions are caught from the
> >> > > brokers
> >> > > > > >> and the type of the producer is such that the exceptions are
> >> > > > retriable?
> >> > > > > >>
> >> > > > > >> Yea, I think in general the exception type mixing is
> difficult
> >> to
> >> > > get
> >> > > > > > them all right. I have already proposed another solution based
> >> on
> >> > my
> >> > > > > > offline discussion with some folks working on EOS
> >> > > > > > to make the handling more straightforward for end users
> without
> >> the
> >> > > > need
> >> > > > > > to distinguish exception fatality.
> >> > > > > >
> >> > > > > >> Best,
> >> > > > > >> Bruno
> >> > > > > >>
> >> > > > > >>
> >> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> >> > > > > >> > Thanks Boyang for the proposal! I made a pass over the list
> >> and
> >> > > here
> >> > > > > are
> >> > > > > >> > some thoughts:
> >> > > > > >> >
> >> > > > > >> > 0) Although this is not part of the public API, I think we
> >> > should
> >> > > > make
> >> > > > > >> sure
> >> > > > > >> > that the suggested pattern (i.e. user can always call
> >> abortTxn()
> >> > > > when
> >> > > > > >> > handling non-fatal errors) are indeed supported. E.g. if
> the
> >> txn
> >> > > is
> >> > > > > >> already
> >> > > > > >> > aborted by the broker side, then users can still call
> >> abortTxn
> >> > > which
> >> > > > > >> would
> >> > > > > >> > not throw another exception but just be treated as a no-op.
> >> > > > > >> >
> >> > > > > >> > 1) *ConcurrentTransactionsException*: I think this error
> can
> >> > also
> >> > > be
> >> > > > > >> > returned but not documented yet. This should be a non-fatal
> >> > error.
> >> > > > > >> >
> >> > > > > >> > 2) *InvalidTxnStateException*: this error is returned from
> >> > broker
> >> > > > when
> >> > > > > >> txn
> >> > > > > >> > state transition failed (e.g. it is trying to transit to
> >> > > > > complete-commit
> >> > > > > >> > while the current state is not prepare-commit). This error
> >> could
> >> > > > > >> indicates
> >> > > > > >> > a bug on the client internal code or the broker code, OR a
> >> user
> >> > > > error
> >> > > > > >> --- a
> >> > > > > >> > similar error is ConcurrentTransactionsException, i.e. if
> >> Kafka
> >> > is
> >> > > > > >> bug-free
> >> > > > > >> > these exceptions would only be returned if users try to do
> >> > > something
> >> > > > > >> wrong,
> >> > > > > >> > e.g. calling abortTxn right after a commitTxn, etc. So I'm
> >> > > thinking
> >> > > > it
> >> > > > > >> > should be a non-fatal error instead of a fatal error, wdyt?
> >> > > > > >> >
> >> > > > > >> > 3) *KafkaException*: case i "indicates fatal transactional
> >> > > sequence
> >> > > > > >> > (Fatal)", this is a bit conflicting with the
> >> > > > *OutOfSequenceException*
> >> > > > > >> that
> >> > > > > >> > is treated as non-fatal. I guess your proposal is that
> >> > > > > >> > OutOfOrderSequenceException would be treated either as
> fatal
> >> > with
> >> > > > > >> > transactional producer, or non-fatal with idempotent
> >> producer,
> >> > is
> >> > > > that
> >> > > > > >> > right? If the producer is only configured with idempotency
> >> but
> >> > not
> >> > > > > >> > transaction, then throwing a
> >> TransactionStateCorruptedException
> >> > > for
> >> > > > > >> > non-fatal errors would be confusing since users are not
> using
> >> > > > > >> transactions
> >> > > > > >> > at all.. So I suggest we always throw
> OutOfSequenceException
> >> > as-is
> >> > > > > (i.e.
> >> > > > > >> > not wrapped) no matter how the producer is configured, and
> >> let
> >> > the
> >> > > > > >> caller
> >> > > > > >> > decide how to handle it based on whether it is only
> >> idempotent
> >> > or
> >> > > > > >> > transactional itself.
> >> > > > > >> >
> >> > > > > >> > 4) Besides all the txn APIs, the `send()` callback / future
> >> can
> >> > > also
> >> > > > > >> throw
> >> > > > > >> > txn-related exceptions, I think this KIP should also cover
> >> this
> >> > > API
> >> > > > as
> >> > > > > >> well?
> >> > > > > >> >
> >> > > > > >> > 5) This is related to 1/2) above: sometimes those non-fatal
> >> > errors
> >> > > > > like
> >> > > > > >> > ConcurrentTxn or InvalidTxnState are not due to the state
> >> being
> >> > > > > >> corrupted
> >> > > > > >> > at the broker side, but maybe users are doing something
> >> wrong.
> >> > So
> >> > > > I'm
> >> > > > > >> > wondering if we should further distinguish those non-fatal
> >> > errors
> >> > > > > >> between
> >> > > > > >> > a) those that are caused by Kafka itself, e.g. a broker
> timed
> >> > out
> >> > > > and
> >> > > > > >> > aborted a txn and later an endTxn request is received, and
> b)
> >> > the
> >> > > > > user's
> >> > > > > >> > API call pattern is incorrect, causing the request to be
> >> > rejected
> >> > > > with
> >> > > > > >> an
> >> > > > > >> > error code from the broker.
> >> *TransactionStateCorruptedException*
> >> > > > feels
> >> > > > > >> to
> >> > > > > >> > me more like for case a), but not case b).
> >> > > > > >> >
> >> > > > > >> >
> >> > > > > >> > Guozhang
> >> > > > > >> >
> >> > > > > >> >
> >> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> >> > > > > reluctanthero104@gmail.com>
> >> > > > > >> > wrote:
> >> > > > > >> >
> >> > > > > >> >> Thanks Matthias, I think your proposal makes sense as
> well,
> >> on
> >> > > the
> >> > > > > pro
> >> > > > > >> side
> >> > > > > >> >> we could have a universally agreed exception type to be
> >> caught
> >> > by
> >> > > > the
> >> > > > > >> user,
> >> > > > > >> >> without having an extra layer on top of the actual
> >> exceptions.
> >> > I
> >> > > > > could
> >> > > > > >> see
> >> > > > > >> >> some issue on downsides:
> >> > > > > >> >>
> >> > > > > >> >> 1. The exception hierarchy will be more complex. Right now
> >> we
> >> > > have
> >> > > > > >> >> RetriableException type, if we are going to add a
> >> > > > > >> >> `ProducerRetriableException` type, we have to put this new
> >> > > > interface
> >> > > > > >> as the
> >> > > > > >> >> parent of the RetriableException, because not all thrown
> >> > > non-fatal
> >> > > > > >> >> exceptions are `retriable` in general for producer, for
> >> example
> >> > > > > >> >> <
> >> > > > > >> >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> >> > > > > >> >>> .
> >> > > > > >> >> We could have an empty interface
> >> `ProducerRetriableException`
> >> > to
> >> > > > let
> >> > > > > >> all
> >> > > > > >> >> the thrown exceptions implement for sure, but it's a bit
> >> > > unorthodox
> >> > > > > in
> >> > > > > >> the
> >> > > > > >> >> way we deal with exceptions in general.
> >> > > > > >> >>
> >> > > > > >> >> 2. There are cases where we throw a KafkaException
> wrapping
> >> > > another
> >> > > > > >> >> KafkaException as either fatal or non-fatal. If we use an
> >> > > interface
> >> > > > > to
> >> > > > > >> >> solve #1, it is also required to implement another bloated
> >> > > > exception
> >> > > > > >> class
> >> > > > > >> >> which could replace KafkaException type, as we couldn't
> mark
> >> > > > > >> KafkaException
> >> > > > > >> >> as retriable for sure.
> >> > > > > >> >>
> >> > > > > >> >> 3. In terms of the encapsulation, wrapping means we could
> >> limit
> >> > > the
> >> > > > > >> scope
> >> > > > > >> >> of affection to the producer only, which is important
> since
> >> we
> >> > > > don't
> >> > > > > >> want
> >> > > > > >> >> shared exception types to implement a producer-related
> >> > interface,
> >> > > > > such
> >> > > > > >> >> as UnknownTopicOrPartitionException.
> >> > > > > >> >>
> >> > > > > >> >> Best,
> >> > > > > >> >> Boyang
> >> > > > > >> >>
> >> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
> >> > mjsax@apache.org
> >> > > >
> >> > > > > >> wrote:
> >> > > > > >> >>
> >> > > > > >> >>> Thanks for the KIP Boyang!
> >> > > > > >> >>>
> >> > > > > >> >>> Overall, categorizing exceptions makes a lot of sense.
> As I
> >> > > don't
> >> > > > > know
> >> > > > > >> >>> the producer internals well enough, I cannot comment on
> the
> >> > > > > >> >>> categorization in detail though.
> >> > > > > >> >>>
> >> > > > > >> >>> What I am wondering is, if we should introduce an
> exception
> >> > > > > interface
> >> > > > > >> >>> that non-fatal exception would implement instead of
> >> creating a
> >> > > new
> >> > > > > >> class
> >> > > > > >> >>> that will wrap non-fatal exceptions? What would be the
> >> > pros/cons
> >> > > > for
> >> > > > > >> >>> both designs?
> >> > > > > >> >>>
> >> > > > > >> >>>
> >> > > > > >> >>> -Matthias
> >> > > > > >> >>>
> >> > > > > >> >>>
> >> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> >> > > > > >> >>>> Hey there,
> >> > > > > >> >>>>
> >> > > > > >> >>>> I would like to start a discussion thread for KIP-691:
> >> > > > > >> >>>>
> >> > > > > >> >>>
> >> > > > > >> >>
> >> > > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> >> > > > > >> >>>>
> >> > > > > >> >>>> The KIP is aiming to simplify the exception handling
> logic
> >> > for
> >> > > > > >> >>>> transactional Producer users by classifying fatal and
> >> > non-fatal
> >> > > > > >> >>> exceptions
> >> > > > > >> >>>> and throw them correspondingly for easier catch and
> retry.
> >> > Let
> >> > > me
> >> > > > > >> know
> >> > > > > >> >>> what
> >> > > > > >> >>>> you think.
> >> > > > > >> >>>>
> >> > > > > >> >>>> Best,
> >> > > > > >> >>>> Boyang
> >> > > > > >> >>>>
> >> > > > > >> >>>
> >> > > > > >> >>
> >> > > > > >> >
> >> > > > > >> >
> >> > > > > >>
> >> > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > > -- Guozhang
> >> > > >
> >> > >
> >> >
> >> >
> >> > --
> >> > -- Guozhang
> >> >
> >>
> >
> >
> > --
> > -- Guozhang
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Guozhang Wang <wa...@gmail.com>.
Or are you going to maintain some internal state such that, the
`abortTransaction` in the catch block would never throw again?

On Fri, Jan 22, 2021 at 11:01 AM Guozhang Wang <wa...@gmail.com> wrote:

> Hi Boyang/Jason,
>
> I've also thought about this (i.e. using CommitFailed for all non-fatal),
> but what I'm pondering is that, in the catch (CommitFailed) block, what
> would happen if the `producer.abortTransaction();` throws again? should
> that be captured as a fatal and cause the client to close again.
>
> If yes, then naively the pattern would be:
>
> ...
> catch (CommitFailedException e) {
>         // Transaction commit failed with abortable error, user could reset
>         // the application state and resume with a new transaction. The
> root
>         // cause was wrapped in the thrown exception.
>         resetToLastCommittedPositions(consumer);
>         try {
>             producer.abortTransaction();
>         } catch (KafkaException e) {
>             producer.close();
>             consumer.close();
>             throw e;
>         }
>     } catch (KafkaException e) {
>         producer.close();
>         consumer.close();
>         throw e;
>     }
> ...
>
> Guozhang
>
> On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <re...@gmail.com>
> wrote:
>
>> Hey Guozhang,
>>
>> Jason and I were discussing the new API offline and decided to take
>> another
>> approach. Firstly, the reason not to invent a new API with returned
>> boolean
>> flag is for compatibility consideration, since old EOS code would not know
>> that a given transaction commit was failed internally as they don't listen
>> to the output flag. Our proposed alternative solution is to let
>> *commitTransaction
>> throw CommitFailedException whenever the commit failed with non-fatal
>> exception*, so that on the caller side the handling logic becomes:
>>
>> try {
>>         if (shouldCommit) {
>>             producer.commitTransaction();
>>         } else {
>>             resetToLastCommittedPositions(consumer);
>>             producer.abortTransaction();
>>         }
>>     } catch (CommitFailedException e) {
>>         // Transaction commit failed with abortable error, user could
>> reset
>>         // the application state and resume with a new transaction. The
>> root
>>         // cause was wrapped in the thrown exception.
>>         resetToLastCommittedPositions(consumer);
>>         producer.abortTransaction();
>>     } catch (KafkaException e) {
>>         producer.close();
>>         consumer.close();
>>         throw e;
>>     }
>>
>> This approach looks cleaner as all exception types other than CommitFailed
>> will doom to be fatal, which is very easy to adopt for users. In the
>> meantime, we still maintain the commitTxn behavior to throw instead of
>> silently failing.
>>
>> In addition, we decided to drop the recommendation to handle
>> TimeoutException and leave it to the users to make the call. The downside
>> for blindly calling abortTxn upon timeout is that we could result in an
>> illegal state when the commit was already successful on the broker
>> side. Without a good guarantee on the outcome, overcomplicating the
>> template should not be encouraged IMHO.
>>
>> Let me know your thoughts on the new approach here, thank you!
>>
>> Best,
>> Boyang
>>
>> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <wa...@gmail.com>
>> wrote:
>>
>> > Thanks for your clarification on 2)/3), that makes sense.
>> >
>> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <
>> reluctanthero104@gmail.com>
>> > wrote:
>> >
>> > > Thanks for the input Guozhang, replied inline.
>> > >
>> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <wa...@gmail.com>
>> > wrote:
>> > >
>> > > > Hello Boyang,
>> > > >
>> > > > Thanks for the updated KIP. I read it again and have the following
>> > > > thoughts:
>> > > >
>> > > > 0. I'm a bit concerned that if commitTxn does not throw any
>> non-fatal
>> > > > exception, and instead we rely on the subsequent beginTxn call to
>> > throw,
>> > > it
>> > > > may violate some callers with a pattern that relying on commitTxn to
>> > > > succeed to make some non-rollback operations. For example:
>> > > >
>> > > > beginTxn()
>> > > > // do some read-write on my local DB
>> > > > commitTxn()
>> > > > // if commitTxn succeeds, then commit the DB
>> > > >
>> > > > -------------
>> > > >
>> > > > The issue is that, committing DB is a non-rollback operation, and
>> users
>> > > may
>> > > > just rely on commitTxn to return without error to make this
>> > non-rollback
>> > > > call. Of course we can just claim this pattern is not legitimate
>> and is
>> > > not
>> > > > the right way of doing things, but many users may naturally adopt
>> this
>> > > > pattern.
>> > > >
>> > > > So maybe we should still let commitTxn also throw non-fatal
>> exceptions,
>> > > in
>> > > > which case we would then call abortTxn again.
>> > > >
>> > > > But if we do this, the pattern becomes:
>> > > >
>> > > > try {
>> > > >    beginTxn()
>> > > >    // do something
>> > > > } catch (Exception) {
>> > > >    shouldCommit = false;
>> > > > }
>> > > >
>> > > > if (shouldCommit) {
>> > > >     try {
>> > > >         commitTxn()
>> > > >     } catch (...) {        // enumerate all fatal exceptions
>> > > >         shutdown()
>> > > >     } catch (KafkaException) {
>> > > >         // non-fatal
>> > > >         shouldCommit = false;
>> > > >     }
>> > > > }
>> > > >
>> > > > if (!shouldCommit && running()) {
>> > > > try {
>> > > >         abortTxn()
>> > > >     } catch (KafkaException) {
>> > > >         // only throw fatal
>> > > >         shutdown()
>> > > >     }
>> > > > }
>> > > >
>> > > > ---------------------
>> > > >
>> > > > Which is much more complicated.
>> > > >
>> > > > Thank makes me think, the alternative we have discussed offline may
>> be
>> > > > better: let commitTxn() to return a boolean flag.
>> > > >
>> > > > * If it returns true, it means the commit succeeded. Users can
>> > > comfortably
>> > > > continue and do any external non-rollback operations if they like.
>> > > > * If it returns false, it means the commit failed with non-fatal
>> error
>> > > *AND
>> > > > the txn has been aborted*. Users do not need to call abortTxn again.
>> > > > * If it throws, then it means fatal errors. Users should shut down
>> the
>> > > > client.
>> > > >
>> > > > In this case, the pattern becomes:
>> > > >
>> > > > try {
>> > > >    beginTxn()
>> > > >    // do something
>> > > > } catch (Exception) {
>> > > >    shouldCommit = false;
>> > > > }
>> > > >
>> > > > try {
>> > > >     if (shouldCommit) {
>> > > >         commitSucceeded = commitTxn()
>> > > >     } else {
>> > > >         // reset offsets, rollback operations, etc
>> > > >         abortTxn()
>> > > >     }
>> > > > } catch (KafkaException) {
>> > > >     // only throw fatal
>> > > >     shutdown()
>> > > > }
>> > > >
>> > > > if (commitSucceeded)
>> > > >    // do other non-rollbackable things
>> > > > else
>> > > >    // reset offsets, rollback operations, etc
>> > > >
>> > > > -------------------------
>> > > >
>> > > > Of course, if we want to have better visibility into what caused the
>> > > commit
>> > > > to fail and txn to abort. We can let the return type be an enum
>> instead
>> > > of
>> > > > a flag. But my main idea is to still let the commitTxn be the final
>> > point
>> > > > users can tell whether this txn succeeded or not, instead of
>> relying on
>> > > the
>> > > > next beginTxn() call.
>> > > >
>> > > > I agree that adding a boolean flag is indeed useful in this case.
>> Will
>> > > update the KIP.
>> > >
>> > > 1. Re: "while maintaining the behavior to throw fatal exception in
>> raw"
>> > not
>> > > > sure what do you mean by "throw" here. Are you proposing the
>> callback
>> > > would
>> > > > *pass* (not throw) in any fatal exceptions when triggered without
>> > > wrapping?
>> > > >
>> > > > Yes, I want to say *pass*, the benefit is to make the end user's
>> > > expectation consistent
>> > > regarding exception handling.
>> > >
>> > >
>> > > > 2. I'm not sure I understand the claim regarding the callback that
>> "In
>> > > EOS
>> > > > setup, it is not required to handle the exception". Are you
>> proposing
>> > > that,
>> > > > e.g. in Streams, we do not try to handle any exceptions if EOS is
>> > enabled
>> > > > in the callback anymore, but just let commitTxn() itself to fail to
>> be
>> > > > notified about the problem? Personally I think in Streams we should
>> > just
>> > > > make the handling logic of the callback to be consistent regardless
>> of
>> > > the
>> > > > EOS settings (today we have different logic depending on this logic,
>> > > which
>> > > > I think could be unified as well).
>> > > >
>> > > > My idea originates from the claim on send API:
>> > > "When used as part of a transaction, it is not necessary to define a
>> > > callback or check the result of the future  in order to detect errors
>> > from
>> > > <code>send</code>. "
>> > > My understanding is that for EOS, the exception will be detected by
>> > calling
>> > > transactional APIs either way, so it is a duplicate handling to track
>> > > all the sendExceptions in RecordCollector. However, I looked up
>> > > sendException is being used today as follow:
>> > >
>> > > 1. Pass to "ProductionExceptionHandler" for any default or customized
>> > > exception handler to handle
>> > > 2. Stop collecting offset info or new exceptions
>> > > 3. Check and rethrow exceptions in close(), flush() or new send()
>> calls
>> > >
>> > > To my understanding, we should still honor the commitment to #1 for
>> any
>> > > user customized implementation. The #2 does not really affect our
>> > decision
>> > > upon EOS. The #3 here is still valuable as it could help us fail fast
>> in
>> > > new send() instead of waiting to later stage of processing. In that
>> > sense,
>> > > I agree we should continue to record send exceptions even under EOS
>> case
>> > to
>> > > ensure the strength of stream side Producer logic. On the safer side,
>> we
>> > no
>> > > longer need to wrap certain fatal exceptions like ProducerFenced as
>> > > TaskMigrated, since they should not crash the stream thread if thrown
>> in
>> > > raw format, once we adopt the new processing model in the send phase.
>> > >
>> > >
>> > > >
>> > > > Guozhang
>> > > >
>> > > >
>> > > >
>> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
>> > reluctanthero104@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Thanks for everyone's feedback so far. I have polished the KIP
>> after
>> > > > > offline discussion with some folks working on EOS to make the
>> > exception
>> > > > > handling more lightweight. The essential change is that we are not
>> > > > > inventing a new intermediate exception type, but instead
>> separating a
>> > > > full
>> > > > > transaction into two phases:
>> > > > >
>> > > > > 1. The data transmission phase
>> > > > > 2. The commit phase
>> > > > >
>> > > > > This way for any exception thrown from phase 1, will be an
>> indicator
>> > in
>> > > > > phase 2 whether we should do commit or abort, and from now on
>> > > > > `commitTransaction` should only throw fatal exceptions, similar to
>> > > > > `abortTransaction`, so that any KafkaException caught in the
>> commit
>> > > phase
>> > > > > will be definitely fatal to crash the app. For more advanced users
>> > such
>> > > > as
>> > > > > Streams, we have the ability to further wrap selected types of
>> fatal
>> > > > > exceptions to trigger task migration if necessary.
>> > > > >
>> > > > > More details in the KIP, feel free to take another look, thanks!
>> > > > >
>> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
>> > > reluctanthero104@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > > > Thanks Bruno for the feedback.
>> > > > > >
>> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <
>> bruno@confluent.io>
>> > > > wrote:
>> > > > > >
>> > > > > >> Thanks Boyang for the KIP!
>> > > > > >>
>> > > > > >> Like Matthias, I do also not know the producer internal well
>> > enough
>> > > to
>> > > > > >> comment on the categorization. However, I think having a super
>> > > > exception
>> > > > > >> (e.g. RetriableException) that  encodes if an exception is
>> fatal
>> > or
>> > > > not
>> > > > > >> is cleaner, better understandable and less error-prone, because
>> > > > ideally
>> > > > > >> when you add a new non-fatal exception in future you just need
>> to
>> > > > think
>> > > > > >> about letting it inherit from the super exception and all the
>> rest
>> > > of
>> > > > > >> the code will just behave correctly without the need to wrap
>> the
>> > new
>> > > > > >> exception into another exception each time it is thrown (maybe
>> it
>> > is
>> > > > > >> thrown at different location in the code).
>> > > > > >>
>> > > > > >> As far as I understand the following statement from your
>> previous
>> > > > e-mail
>> > > > > >> is the reason that currently such a super exception is not
>> > possible:
>> > > > > >>
>> > > > > >> "Right now we have RetriableException type, if we are going to
>> > add a
>> > > > > >> `ProducerRetriableException` type, we have to put this new
>> > interface
>> > > > as
>> > > > > >> the parent of the RetriableException, because not all thrown
>> > > non-fatal
>> > > > > >> exceptions are `retriable` in general for producer"
>> > > > > >>
>> > > > > >>
>> > > > > >> In the list of exceptions in your KIP, I found non-fatal
>> > exceptions
>> > > > that
>> > > > > >> do not inherit from RetriableException. I guess those are the
>> ones
>> > > you
>> > > > > >> are referring to in your statement:
>> > > > > >>
>> > > > > >> InvalidProducerEpochException
>> > > > > >> InvalidPidMappingException
>> > > > > >> TransactionAbortedException
>> > > > > >>
>> > > > > >> All of those exceptions are non-fatal and do not inherit from
>> > > > > >> RetriableException. Is there a reason for that? If they
>> depended
>> > > from
>> > > > > >> RetriableException we would be a bit closer to a super
>> exception I
>> > > > > >> mention above.
>> > > > > >>
>> > > > > >> The reason is that sender may catch those exceptions in the
>> > > > > > ProduceResponse, and it currently does infinite
>> > > > > > retries on RetriableException. To make sure we could trigger the
>> > > > > > abortTransaction() by catching non-fatal thrown
>> > > > > > exceptions and reinitialize the txn state, we chose not to let
>> > those
>> > > > > > exceptions inherit RetriableException so that
>> > > > > > they won't cause infinite retry on sender.
>> > > > > >
>> > > > > >
>> > > > > >> With OutOfOrderSequenceException and
>> UnknownProducerIdException, I
>> > > > think
>> > > > > >> to understand that their fatality depends on the type (i.e.
>> > > > > >> configuration) of the producer. That makes it difficult to
>> have a
>> > > > super
>> > > > > >> exception that encodes the retriability as mentioned above.
>> Would
>> > it
>> > > > be
>> > > > > >> possible to introduce exceptions that inherit from
>> > > RetriableException
>> > > > > >> and that are thrown when those exceptions are caught from the
>> > > brokers
>> > > > > >> and the type of the producer is such that the exceptions are
>> > > > retriable?
>> > > > > >>
>> > > > > >> Yea, I think in general the exception type mixing is difficult
>> to
>> > > get
>> > > > > > them all right. I have already proposed another solution based
>> on
>> > my
>> > > > > > offline discussion with some folks working on EOS
>> > > > > > to make the handling more straightforward for end users without
>> the
>> > > > need
>> > > > > > to distinguish exception fatality.
>> > > > > >
>> > > > > >> Best,
>> > > > > >> Bruno
>> > > > > >>
>> > > > > >>
>> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
>> > > > > >> > Thanks Boyang for the proposal! I made a pass over the list
>> and
>> > > here
>> > > > > are
>> > > > > >> > some thoughts:
>> > > > > >> >
>> > > > > >> > 0) Although this is not part of the public API, I think we
>> > should
>> > > > make
>> > > > > >> sure
>> > > > > >> > that the suggested pattern (i.e. user can always call
>> abortTxn()
>> > > > when
>> > > > > >> > handling non-fatal errors) are indeed supported. E.g. if the
>> txn
>> > > is
>> > > > > >> already
>> > > > > >> > aborted by the broker side, then users can still call
>> abortTxn
>> > > which
>> > > > > >> would
>> > > > > >> > not throw another exception but just be treated as a no-op.
>> > > > > >> >
>> > > > > >> > 1) *ConcurrentTransactionsException*: I think this error can
>> > also
>> > > be
>> > > > > >> > returned but not documented yet. This should be a non-fatal
>> > error.
>> > > > > >> >
>> > > > > >> > 2) *InvalidTxnStateException*: this error is returned from
>> > broker
>> > > > when
>> > > > > >> txn
>> > > > > >> > state transition failed (e.g. it is trying to transit to
>> > > > > complete-commit
>> > > > > >> > while the current state is not prepare-commit). This error
>> could
>> > > > > >> indicates
>> > > > > >> > a bug on the client internal code or the broker code, OR a
>> user
>> > > > error
>> > > > > >> --- a
>> > > > > >> > similar error is ConcurrentTransactionsException, i.e. if
>> Kafka
>> > is
>> > > > > >> bug-free
>> > > > > >> > these exceptions would only be returned if users try to do
>> > > something
>> > > > > >> wrong,
>> > > > > >> > e.g. calling abortTxn right after a commitTxn, etc. So I'm
>> > > thinking
>> > > > it
>> > > > > >> > should be a non-fatal error instead of a fatal error, wdyt?
>> > > > > >> >
>> > > > > >> > 3) *KafkaException*: case i "indicates fatal transactional
>> > > sequence
>> > > > > >> > (Fatal)", this is a bit conflicting with the
>> > > > *OutOfSequenceException*
>> > > > > >> that
>> > > > > >> > is treated as non-fatal. I guess your proposal is that
>> > > > > >> > OutOfOrderSequenceException would be treated either as fatal
>> > with
>> > > > > >> > transactional producer, or non-fatal with idempotent
>> producer,
>> > is
>> > > > that
>> > > > > >> > right? If the producer is only configured with idempotency
>> but
>> > not
>> > > > > >> > transaction, then throwing a
>> TransactionStateCorruptedException
>> > > for
>> > > > > >> > non-fatal errors would be confusing since users are not using
>> > > > > >> transactions
>> > > > > >> > at all.. So I suggest we always throw OutOfSequenceException
>> > as-is
>> > > > > (i.e.
>> > > > > >> > not wrapped) no matter how the producer is configured, and
>> let
>> > the
>> > > > > >> caller
>> > > > > >> > decide how to handle it based on whether it is only
>> idempotent
>> > or
>> > > > > >> > transactional itself.
>> > > > > >> >
>> > > > > >> > 4) Besides all the txn APIs, the `send()` callback / future
>> can
>> > > also
>> > > > > >> throw
>> > > > > >> > txn-related exceptions, I think this KIP should also cover
>> this
>> > > API
>> > > > as
>> > > > > >> well?
>> > > > > >> >
>> > > > > >> > 5) This is related to 1/2) above: sometimes those non-fatal
>> > errors
>> > > > > like
>> > > > > >> > ConcurrentTxn or InvalidTxnState are not due to the state
>> being
>> > > > > >> corrupted
>> > > > > >> > at the broker side, but maybe users are doing something
>> wrong.
>> > So
>> > > > I'm
>> > > > > >> > wondering if we should further distinguish those non-fatal
>> > errors
>> > > > > >> between
>> > > > > >> > a) those that are caused by Kafka itself, e.g. a broker timed
>> > out
>> > > > and
>> > > > > >> > aborted a txn and later an endTxn request is received, and b)
>> > the
>> > > > > user's
>> > > > > >> > API call pattern is incorrect, causing the request to be
>> > rejected
>> > > > with
>> > > > > >> an
>> > > > > >> > error code from the broker.
>> *TransactionStateCorruptedException*
>> > > > feels
>> > > > > >> to
>> > > > > >> > me more like for case a), but not case b).
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > Guozhang
>> > > > > >> >
>> > > > > >> >
>> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
>> > > > > reluctanthero104@gmail.com>
>> > > > > >> > wrote:
>> > > > > >> >
>> > > > > >> >> Thanks Matthias, I think your proposal makes sense as well,
>> on
>> > > the
>> > > > > pro
>> > > > > >> side
>> > > > > >> >> we could have a universally agreed exception type to be
>> caught
>> > by
>> > > > the
>> > > > > >> user,
>> > > > > >> >> without having an extra layer on top of the actual
>> exceptions.
>> > I
>> > > > > could
>> > > > > >> see
>> > > > > >> >> some issue on downsides:
>> > > > > >> >>
>> > > > > >> >> 1. The exception hierarchy will be more complex. Right now
>> we
>> > > have
>> > > > > >> >> RetriableException type, if we are going to add a
>> > > > > >> >> `ProducerRetriableException` type, we have to put this new
>> > > > interface
>> > > > > >> as the
>> > > > > >> >> parent of the RetriableException, because not all thrown
>> > > non-fatal
>> > > > > >> >> exceptions are `retriable` in general for producer, for
>> example
>> > > > > >> >> <
>> > > > > >> >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
>> > > > > >> >>> .
>> > > > > >> >> We could have an empty interface
>> `ProducerRetriableException`
>> > to
>> > > > let
>> > > > > >> all
>> > > > > >> >> the thrown exceptions implement for sure, but it's a bit
>> > > unorthodox
>> > > > > in
>> > > > > >> the
>> > > > > >> >> way we deal with exceptions in general.
>> > > > > >> >>
>> > > > > >> >> 2. There are cases where we throw a KafkaException wrapping
>> > > another
>> > > > > >> >> KafkaException as either fatal or non-fatal. If we use an
>> > > interface
>> > > > > to
>> > > > > >> >> solve #1, it is also required to implement another bloated
>> > > > exception
>> > > > > >> class
>> > > > > >> >> which could replace KafkaException type, as we couldn't mark
>> > > > > >> KafkaException
>> > > > > >> >> as retriable for sure.
>> > > > > >> >>
>> > > > > >> >> 3. In terms of the encapsulation, wrapping means we could
>> limit
>> > > the
>> > > > > >> scope
>> > > > > >> >> of affection to the producer only, which is important since
>> we
>> > > > don't
>> > > > > >> want
>> > > > > >> >> shared exception types to implement a producer-related
>> > interface,
>> > > > > such
>> > > > > >> >> as UnknownTopicOrPartitionException.
>> > > > > >> >>
>> > > > > >> >> Best,
>> > > > > >> >> Boyang
>> > > > > >> >>
>> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
>> > mjsax@apache.org
>> > > >
>> > > > > >> wrote:
>> > > > > >> >>
>> > > > > >> >>> Thanks for the KIP Boyang!
>> > > > > >> >>>
>> > > > > >> >>> Overall, categorizing exceptions makes a lot of sense. As I
>> > > don't
>> > > > > know
>> > > > > >> >>> the producer internals well enough, I cannot comment on the
>> > > > > >> >>> categorization in detail though.
>> > > > > >> >>>
>> > > > > >> >>> What I am wondering is, if we should introduce an exception
>> > > > > interface
>> > > > > >> >>> that non-fatal exception would implement instead of
>> creating a
>> > > new
>> > > > > >> class
>> > > > > >> >>> that will wrap non-fatal exceptions? What would be the
>> > pros/cons
>> > > > for
>> > > > > >> >>> both designs?
>> > > > > >> >>>
>> > > > > >> >>>
>> > > > > >> >>> -Matthias
>> > > > > >> >>>
>> > > > > >> >>>
>> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
>> > > > > >> >>>> Hey there,
>> > > > > >> >>>>
>> > > > > >> >>>> I would like to start a discussion thread for KIP-691:
>> > > > > >> >>>>
>> > > > > >> >>>
>> > > > > >> >>
>> > > > > >>
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
>> > > > > >> >>>>
>> > > > > >> >>>> The KIP is aiming to simplify the exception handling logic
>> > for
>> > > > > >> >>>> transactional Producer users by classifying fatal and
>> > non-fatal
>> > > > > >> >>> exceptions
>> > > > > >> >>>> and throw them correspondingly for easier catch and retry.
>> > Let
>> > > me
>> > > > > >> know
>> > > > > >> >>> what
>> > > > > >> >>>> you think.
>> > > > > >> >>>>
>> > > > > >> >>>> Best,
>> > > > > >> >>>> Boyang
>> > > > > >> >>>>
>> > > > > >> >>>
>> > > > > >> >>
>> > > > > >> >
>> > > > > >> >
>> > > > > >>
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > > -- Guozhang
>> > > >
>> > >
>> >
>> >
>> > --
>> > -- Guozhang
>> >
>>
>
>
> --
> -- Guozhang
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Guozhang Wang <wa...@gmail.com>.
Hi Boyang/Jason,

I've also thought about this (i.e. using CommitFailed for all non-fatal),
but what I'm pondering is that, in the catch (CommitFailed) block, what
would happen if the `producer.abortTransaction();` throws again? should
that be captured as a fatal and cause the client to close again.

If yes, then naively the pattern would be:

...
catch (CommitFailedException e) {
        // Transaction commit failed with abortable error, user could reset
        // the application state and resume with a new transaction. The root
        // cause was wrapped in the thrown exception.
        resetToLastCommittedPositions(consumer);
        try {
            producer.abortTransaction();
        } catch (KafkaException e) {
            producer.close();
            consumer.close();
            throw e;
        }
    } catch (KafkaException e) {
        producer.close();
        consumer.close();
        throw e;
    }
...

Guozhang

On Fri, Jan 22, 2021 at 10:47 AM Boyang Chen <re...@gmail.com>
wrote:

> Hey Guozhang,
>
> Jason and I were discussing the new API offline and decided to take another
> approach. Firstly, the reason not to invent a new API with returned boolean
> flag is for compatibility consideration, since old EOS code would not know
> that a given transaction commit was failed internally as they don't listen
> to the output flag. Our proposed alternative solution is to let
> *commitTransaction
> throw CommitFailedException whenever the commit failed with non-fatal
> exception*, so that on the caller side the handling logic becomes:
>
> try {
>         if (shouldCommit) {
>             producer.commitTransaction();
>         } else {
>             resetToLastCommittedPositions(consumer);
>             producer.abortTransaction();
>         }
>     } catch (CommitFailedException e) {
>         // Transaction commit failed with abortable error, user could reset
>         // the application state and resume with a new transaction. The
> root
>         // cause was wrapped in the thrown exception.
>         resetToLastCommittedPositions(consumer);
>         producer.abortTransaction();
>     } catch (KafkaException e) {
>         producer.close();
>         consumer.close();
>         throw e;
>     }
>
> This approach looks cleaner as all exception types other than CommitFailed
> will doom to be fatal, which is very easy to adopt for users. In the
> meantime, we still maintain the commitTxn behavior to throw instead of
> silently failing.
>
> In addition, we decided to drop the recommendation to handle
> TimeoutException and leave it to the users to make the call. The downside
> for blindly calling abortTxn upon timeout is that we could result in an
> illegal state when the commit was already successful on the broker
> side. Without a good guarantee on the outcome, overcomplicating the
> template should not be encouraged IMHO.
>
> Let me know your thoughts on the new approach here, thank you!
>
> Best,
> Boyang
>
> On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Thanks for your clarification on 2)/3), that makes sense.
> >
> > On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <reluctanthero104@gmail.com
> >
> > wrote:
> >
> > > Thanks for the input Guozhang, replied inline.
> > >
> > > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Hello Boyang,
> > > >
> > > > Thanks for the updated KIP. I read it again and have the following
> > > > thoughts:
> > > >
> > > > 0. I'm a bit concerned that if commitTxn does not throw any non-fatal
> > > > exception, and instead we rely on the subsequent beginTxn call to
> > throw,
> > > it
> > > > may violate some callers with a pattern that relying on commitTxn to
> > > > succeed to make some non-rollback operations. For example:
> > > >
> > > > beginTxn()
> > > > // do some read-write on my local DB
> > > > commitTxn()
> > > > // if commitTxn succeeds, then commit the DB
> > > >
> > > > -------------
> > > >
> > > > The issue is that, committing DB is a non-rollback operation, and
> users
> > > may
> > > > just rely on commitTxn to return without error to make this
> > non-rollback
> > > > call. Of course we can just claim this pattern is not legitimate and
> is
> > > not
> > > > the right way of doing things, but many users may naturally adopt
> this
> > > > pattern.
> > > >
> > > > So maybe we should still let commitTxn also throw non-fatal
> exceptions,
> > > in
> > > > which case we would then call abortTxn again.
> > > >
> > > > But if we do this, the pattern becomes:
> > > >
> > > > try {
> > > >    beginTxn()
> > > >    // do something
> > > > } catch (Exception) {
> > > >    shouldCommit = false;
> > > > }
> > > >
> > > > if (shouldCommit) {
> > > >     try {
> > > >         commitTxn()
> > > >     } catch (...) {        // enumerate all fatal exceptions
> > > >         shutdown()
> > > >     } catch (KafkaException) {
> > > >         // non-fatal
> > > >         shouldCommit = false;
> > > >     }
> > > > }
> > > >
> > > > if (!shouldCommit && running()) {
> > > > try {
> > > >         abortTxn()
> > > >     } catch (KafkaException) {
> > > >         // only throw fatal
> > > >         shutdown()
> > > >     }
> > > > }
> > > >
> > > > ---------------------
> > > >
> > > > Which is much more complicated.
> > > >
> > > > Thank makes me think, the alternative we have discussed offline may
> be
> > > > better: let commitTxn() to return a boolean flag.
> > > >
> > > > * If it returns true, it means the commit succeeded. Users can
> > > comfortably
> > > > continue and do any external non-rollback operations if they like.
> > > > * If it returns false, it means the commit failed with non-fatal
> error
> > > *AND
> > > > the txn has been aborted*. Users do not need to call abortTxn again.
> > > > * If it throws, then it means fatal errors. Users should shut down
> the
> > > > client.
> > > >
> > > > In this case, the pattern becomes:
> > > >
> > > > try {
> > > >    beginTxn()
> > > >    // do something
> > > > } catch (Exception) {
> > > >    shouldCommit = false;
> > > > }
> > > >
> > > > try {
> > > >     if (shouldCommit) {
> > > >         commitSucceeded = commitTxn()
> > > >     } else {
> > > >         // reset offsets, rollback operations, etc
> > > >         abortTxn()
> > > >     }
> > > > } catch (KafkaException) {
> > > >     // only throw fatal
> > > >     shutdown()
> > > > }
> > > >
> > > > if (commitSucceeded)
> > > >    // do other non-rollbackable things
> > > > else
> > > >    // reset offsets, rollback operations, etc
> > > >
> > > > -------------------------
> > > >
> > > > Of course, if we want to have better visibility into what caused the
> > > commit
> > > > to fail and txn to abort. We can let the return type be an enum
> instead
> > > of
> > > > a flag. But my main idea is to still let the commitTxn be the final
> > point
> > > > users can tell whether this txn succeeded or not, instead of relying
> on
> > > the
> > > > next beginTxn() call.
> > > >
> > > > I agree that adding a boolean flag is indeed useful in this case.
> Will
> > > update the KIP.
> > >
> > > 1. Re: "while maintaining the behavior to throw fatal exception in raw"
> > not
> > > > sure what do you mean by "throw" here. Are you proposing the callback
> > > would
> > > > *pass* (not throw) in any fatal exceptions when triggered without
> > > wrapping?
> > > >
> > > > Yes, I want to say *pass*, the benefit is to make the end user's
> > > expectation consistent
> > > regarding exception handling.
> > >
> > >
> > > > 2. I'm not sure I understand the claim regarding the callback that
> "In
> > > EOS
> > > > setup, it is not required to handle the exception". Are you proposing
> > > that,
> > > > e.g. in Streams, we do not try to handle any exceptions if EOS is
> > enabled
> > > > in the callback anymore, but just let commitTxn() itself to fail to
> be
> > > > notified about the problem? Personally I think in Streams we should
> > just
> > > > make the handling logic of the callback to be consistent regardless
> of
> > > the
> > > > EOS settings (today we have different logic depending on this logic,
> > > which
> > > > I think could be unified as well).
> > > >
> > > > My idea originates from the claim on send API:
> > > "When used as part of a transaction, it is not necessary to define a
> > > callback or check the result of the future  in order to detect errors
> > from
> > > <code>send</code>. "
> > > My understanding is that for EOS, the exception will be detected by
> > calling
> > > transactional APIs either way, so it is a duplicate handling to track
> > > all the sendExceptions in RecordCollector. However, I looked up
> > > sendException is being used today as follow:
> > >
> > > 1. Pass to "ProductionExceptionHandler" for any default or customized
> > > exception handler to handle
> > > 2. Stop collecting offset info or new exceptions
> > > 3. Check and rethrow exceptions in close(), flush() or new send() calls
> > >
> > > To my understanding, we should still honor the commitment to #1 for any
> > > user customized implementation. The #2 does not really affect our
> > decision
> > > upon EOS. The #3 here is still valuable as it could help us fail fast
> in
> > > new send() instead of waiting to later stage of processing. In that
> > sense,
> > > I agree we should continue to record send exceptions even under EOS
> case
> > to
> > > ensure the strength of stream side Producer logic. On the safer side,
> we
> > no
> > > longer need to wrap certain fatal exceptions like ProducerFenced as
> > > TaskMigrated, since they should not crash the stream thread if thrown
> in
> > > raw format, once we adopt the new processing model in the send phase.
> > >
> > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > >
> > > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks for everyone's feedback so far. I have polished the KIP
> after
> > > > > offline discussion with some folks working on EOS to make the
> > exception
> > > > > handling more lightweight. The essential change is that we are not
> > > > > inventing a new intermediate exception type, but instead
> separating a
> > > > full
> > > > > transaction into two phases:
> > > > >
> > > > > 1. The data transmission phase
> > > > > 2. The commit phase
> > > > >
> > > > > This way for any exception thrown from phase 1, will be an
> indicator
> > in
> > > > > phase 2 whether we should do commit or abort, and from now on
> > > > > `commitTransaction` should only throw fatal exceptions, similar to
> > > > > `abortTransaction`, so that any KafkaException caught in the commit
> > > phase
> > > > > will be definitely fatal to crash the app. For more advanced users
> > such
> > > > as
> > > > > Streams, we have the ability to further wrap selected types of
> fatal
> > > > > exceptions to trigger task migration if necessary.
> > > > >
> > > > > More details in the KIP, feel free to take another look, thanks!
> > > > >
> > > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Thanks Bruno for the feedback.
> > > > > >
> > > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <bruno@confluent.io
> >
> > > > wrote:
> > > > > >
> > > > > >> Thanks Boyang for the KIP!
> > > > > >>
> > > > > >> Like Matthias, I do also not know the producer internal well
> > enough
> > > to
> > > > > >> comment on the categorization. However, I think having a super
> > > > exception
> > > > > >> (e.g. RetriableException) that  encodes if an exception is fatal
> > or
> > > > not
> > > > > >> is cleaner, better understandable and less error-prone, because
> > > > ideally
> > > > > >> when you add a new non-fatal exception in future you just need
> to
> > > > think
> > > > > >> about letting it inherit from the super exception and all the
> rest
> > > of
> > > > > >> the code will just behave correctly without the need to wrap the
> > new
> > > > > >> exception into another exception each time it is thrown (maybe
> it
> > is
> > > > > >> thrown at different location in the code).
> > > > > >>
> > > > > >> As far as I understand the following statement from your
> previous
> > > > e-mail
> > > > > >> is the reason that currently such a super exception is not
> > possible:
> > > > > >>
> > > > > >> "Right now we have RetriableException type, if we are going to
> > add a
> > > > > >> `ProducerRetriableException` type, we have to put this new
> > interface
> > > > as
> > > > > >> the parent of the RetriableException, because not all thrown
> > > non-fatal
> > > > > >> exceptions are `retriable` in general for producer"
> > > > > >>
> > > > > >>
> > > > > >> In the list of exceptions in your KIP, I found non-fatal
> > exceptions
> > > > that
> > > > > >> do not inherit from RetriableException. I guess those are the
> ones
> > > you
> > > > > >> are referring to in your statement:
> > > > > >>
> > > > > >> InvalidProducerEpochException
> > > > > >> InvalidPidMappingException
> > > > > >> TransactionAbortedException
> > > > > >>
> > > > > >> All of those exceptions are non-fatal and do not inherit from
> > > > > >> RetriableException. Is there a reason for that? If they depended
> > > from
> > > > > >> RetriableException we would be a bit closer to a super
> exception I
> > > > > >> mention above.
> > > > > >>
> > > > > >> The reason is that sender may catch those exceptions in the
> > > > > > ProduceResponse, and it currently does infinite
> > > > > > retries on RetriableException. To make sure we could trigger the
> > > > > > abortTransaction() by catching non-fatal thrown
> > > > > > exceptions and reinitialize the txn state, we chose not to let
> > those
> > > > > > exceptions inherit RetriableException so that
> > > > > > they won't cause infinite retry on sender.
> > > > > >
> > > > > >
> > > > > >> With OutOfOrderSequenceException and
> UnknownProducerIdException, I
> > > > think
> > > > > >> to understand that their fatality depends on the type (i.e.
> > > > > >> configuration) of the producer. That makes it difficult to have
> a
> > > > super
> > > > > >> exception that encodes the retriability as mentioned above.
> Would
> > it
> > > > be
> > > > > >> possible to introduce exceptions that inherit from
> > > RetriableException
> > > > > >> and that are thrown when those exceptions are caught from the
> > > brokers
> > > > > >> and the type of the producer is such that the exceptions are
> > > > retriable?
> > > > > >>
> > > > > >> Yea, I think in general the exception type mixing is difficult
> to
> > > get
> > > > > > them all right. I have already proposed another solution based on
> > my
> > > > > > offline discussion with some folks working on EOS
> > > > > > to make the handling more straightforward for end users without
> the
> > > > need
> > > > > > to distinguish exception fatality.
> > > > > >
> > > > > >> Best,
> > > > > >> Bruno
> > > > > >>
> > > > > >>
> > > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > > >> > Thanks Boyang for the proposal! I made a pass over the list
> and
> > > here
> > > > > are
> > > > > >> > some thoughts:
> > > > > >> >
> > > > > >> > 0) Although this is not part of the public API, I think we
> > should
> > > > make
> > > > > >> sure
> > > > > >> > that the suggested pattern (i.e. user can always call
> abortTxn()
> > > > when
> > > > > >> > handling non-fatal errors) are indeed supported. E.g. if the
> txn
> > > is
> > > > > >> already
> > > > > >> > aborted by the broker side, then users can still call abortTxn
> > > which
> > > > > >> would
> > > > > >> > not throw another exception but just be treated as a no-op.
> > > > > >> >
> > > > > >> > 1) *ConcurrentTransactionsException*: I think this error can
> > also
> > > be
> > > > > >> > returned but not documented yet. This should be a non-fatal
> > error.
> > > > > >> >
> > > > > >> > 2) *InvalidTxnStateException*: this error is returned from
> > broker
> > > > when
> > > > > >> txn
> > > > > >> > state transition failed (e.g. it is trying to transit to
> > > > > complete-commit
> > > > > >> > while the current state is not prepare-commit). This error
> could
> > > > > >> indicates
> > > > > >> > a bug on the client internal code or the broker code, OR a
> user
> > > > error
> > > > > >> --- a
> > > > > >> > similar error is ConcurrentTransactionsException, i.e. if
> Kafka
> > is
> > > > > >> bug-free
> > > > > >> > these exceptions would only be returned if users try to do
> > > something
> > > > > >> wrong,
> > > > > >> > e.g. calling abortTxn right after a commitTxn, etc. So I'm
> > > thinking
> > > > it
> > > > > >> > should be a non-fatal error instead of a fatal error, wdyt?
> > > > > >> >
> > > > > >> > 3) *KafkaException*: case i "indicates fatal transactional
> > > sequence
> > > > > >> > (Fatal)", this is a bit conflicting with the
> > > > *OutOfSequenceException*
> > > > > >> that
> > > > > >> > is treated as non-fatal. I guess your proposal is that
> > > > > >> > OutOfOrderSequenceException would be treated either as fatal
> > with
> > > > > >> > transactional producer, or non-fatal with idempotent producer,
> > is
> > > > that
> > > > > >> > right? If the producer is only configured with idempotency but
> > not
> > > > > >> > transaction, then throwing a
> TransactionStateCorruptedException
> > > for
> > > > > >> > non-fatal errors would be confusing since users are not using
> > > > > >> transactions
> > > > > >> > at all.. So I suggest we always throw OutOfSequenceException
> > as-is
> > > > > (i.e.
> > > > > >> > not wrapped) no matter how the producer is configured, and let
> > the
> > > > > >> caller
> > > > > >> > decide how to handle it based on whether it is only idempotent
> > or
> > > > > >> > transactional itself.
> > > > > >> >
> > > > > >> > 4) Besides all the txn APIs, the `send()` callback / future
> can
> > > also
> > > > > >> throw
> > > > > >> > txn-related exceptions, I think this KIP should also cover
> this
> > > API
> > > > as
> > > > > >> well?
> > > > > >> >
> > > > > >> > 5) This is related to 1/2) above: sometimes those non-fatal
> > errors
> > > > > like
> > > > > >> > ConcurrentTxn or InvalidTxnState are not due to the state
> being
> > > > > >> corrupted
> > > > > >> > at the broker side, but maybe users are doing something wrong.
> > So
> > > > I'm
> > > > > >> > wondering if we should further distinguish those non-fatal
> > errors
> > > > > >> between
> > > > > >> > a) those that are caused by Kafka itself, e.g. a broker timed
> > out
> > > > and
> > > > > >> > aborted a txn and later an endTxn request is received, and b)
> > the
> > > > > user's
> > > > > >> > API call pattern is incorrect, causing the request to be
> > rejected
> > > > with
> > > > > >> an
> > > > > >> > error code from the broker.
> *TransactionStateCorruptedException*
> > > > feels
> > > > > >> to
> > > > > >> > me more like for case a), but not case b).
> > > > > >> >
> > > > > >> >
> > > > > >> > Guozhang
> > > > > >> >
> > > > > >> >
> > > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > > > > reluctanthero104@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> >> Thanks Matthias, I think your proposal makes sense as well,
> on
> > > the
> > > > > pro
> > > > > >> side
> > > > > >> >> we could have a universally agreed exception type to be
> caught
> > by
> > > > the
> > > > > >> user,
> > > > > >> >> without having an extra layer on top of the actual
> exceptions.
> > I
> > > > > could
> > > > > >> see
> > > > > >> >> some issue on downsides:
> > > > > >> >>
> > > > > >> >> 1. The exception hierarchy will be more complex. Right now we
> > > have
> > > > > >> >> RetriableException type, if we are going to add a
> > > > > >> >> `ProducerRetriableException` type, we have to put this new
> > > > interface
> > > > > >> as the
> > > > > >> >> parent of the RetriableException, because not all thrown
> > > non-fatal
> > > > > >> >> exceptions are `retriable` in general for producer, for
> example
> > > > > >> >> <
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > > >> >>> .
> > > > > >> >> We could have an empty interface `ProducerRetriableException`
> > to
> > > > let
> > > > > >> all
> > > > > >> >> the thrown exceptions implement for sure, but it's a bit
> > > unorthodox
> > > > > in
> > > > > >> the
> > > > > >> >> way we deal with exceptions in general.
> > > > > >> >>
> > > > > >> >> 2. There are cases where we throw a KafkaException wrapping
> > > another
> > > > > >> >> KafkaException as either fatal or non-fatal. If we use an
> > > interface
> > > > > to
> > > > > >> >> solve #1, it is also required to implement another bloated
> > > > exception
> > > > > >> class
> > > > > >> >> which could replace KafkaException type, as we couldn't mark
> > > > > >> KafkaException
> > > > > >> >> as retriable for sure.
> > > > > >> >>
> > > > > >> >> 3. In terms of the encapsulation, wrapping means we could
> limit
> > > the
> > > > > >> scope
> > > > > >> >> of affection to the producer only, which is important since
> we
> > > > don't
> > > > > >> want
> > > > > >> >> shared exception types to implement a producer-related
> > interface,
> > > > > such
> > > > > >> >> as UnknownTopicOrPartitionException.
> > > > > >> >>
> > > > > >> >> Best,
> > > > > >> >> Boyang
> > > > > >> >>
> > > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
> > mjsax@apache.org
> > > >
> > > > > >> wrote:
> > > > > >> >>
> > > > > >> >>> Thanks for the KIP Boyang!
> > > > > >> >>>
> > > > > >> >>> Overall, categorizing exceptions makes a lot of sense. As I
> > > don't
> > > > > know
> > > > > >> >>> the producer internals well enough, I cannot comment on the
> > > > > >> >>> categorization in detail though.
> > > > > >> >>>
> > > > > >> >>> What I am wondering is, if we should introduce an exception
> > > > > interface
> > > > > >> >>> that non-fatal exception would implement instead of
> creating a
> > > new
> > > > > >> class
> > > > > >> >>> that will wrap non-fatal exceptions? What would be the
> > pros/cons
> > > > for
> > > > > >> >>> both designs?
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> -Matthias
> > > > > >> >>>
> > > > > >> >>>
> > > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > > > >> >>>> Hey there,
> > > > > >> >>>>
> > > > > >> >>>> I would like to start a discussion thread for KIP-691:
> > > > > >> >>>>
> > > > > >> >>>
> > > > > >> >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > > >> >>>>
> > > > > >> >>>> The KIP is aiming to simplify the exception handling logic
> > for
> > > > > >> >>>> transactional Producer users by classifying fatal and
> > non-fatal
> > > > > >> >>> exceptions
> > > > > >> >>>> and throw them correspondingly for easier catch and retry.
> > Let
> > > me
> > > > > >> know
> > > > > >> >>> what
> > > > > >> >>>> you think.
> > > > > >> >>>>
> > > > > >> >>>> Best,
> > > > > >> >>>> Boyang
> > > > > >> >>>>
> > > > > >> >>>
> > > > > >> >>
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Hey Guozhang,

Jason and I were discussing the new API offline and decided to take another
approach. Firstly, the reason not to invent a new API with returned boolean
flag is for compatibility consideration, since old EOS code would not know
that a given transaction commit was failed internally as they don't listen
to the output flag. Our proposed alternative solution is to let
*commitTransaction
throw CommitFailedException whenever the commit failed with non-fatal
exception*, so that on the caller side the handling logic becomes:

try {
        if (shouldCommit) {
            producer.commitTransaction();
        } else {
            resetToLastCommittedPositions(consumer);
            producer.abortTransaction();
        }
    } catch (CommitFailedException e) {
        // Transaction commit failed with abortable error, user could reset
        // the application state and resume with a new transaction. The root
        // cause was wrapped in the thrown exception.
        resetToLastCommittedPositions(consumer);
        producer.abortTransaction();
    } catch (KafkaException e) {
        producer.close();
        consumer.close();
        throw e;
    }

This approach looks cleaner as all exception types other than CommitFailed
will doom to be fatal, which is very easy to adopt for users. In the
meantime, we still maintain the commitTxn behavior to throw instead of
silently failing.

In addition, we decided to drop the recommendation to handle
TimeoutException and leave it to the users to make the call. The downside
for blindly calling abortTxn upon timeout is that we could result in an
illegal state when the commit was already successful on the broker
side. Without a good guarantee on the outcome, overcomplicating the
template should not be encouraged IMHO.

Let me know your thoughts on the new approach here, thank you!

Best,
Boyang

On Tue, Jan 19, 2021 at 11:11 AM Guozhang Wang <wa...@gmail.com> wrote:

> Thanks for your clarification on 2)/3), that makes sense.
>
> On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Thanks for the input Guozhang, replied inline.
> >
> > On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Hello Boyang,
> > >
> > > Thanks for the updated KIP. I read it again and have the following
> > > thoughts:
> > >
> > > 0. I'm a bit concerned that if commitTxn does not throw any non-fatal
> > > exception, and instead we rely on the subsequent beginTxn call to
> throw,
> > it
> > > may violate some callers with a pattern that relying on commitTxn to
> > > succeed to make some non-rollback operations. For example:
> > >
> > > beginTxn()
> > > // do some read-write on my local DB
> > > commitTxn()
> > > // if commitTxn succeeds, then commit the DB
> > >
> > > -------------
> > >
> > > The issue is that, committing DB is a non-rollback operation, and users
> > may
> > > just rely on commitTxn to return without error to make this
> non-rollback
> > > call. Of course we can just claim this pattern is not legitimate and is
> > not
> > > the right way of doing things, but many users may naturally adopt this
> > > pattern.
> > >
> > > So maybe we should still let commitTxn also throw non-fatal exceptions,
> > in
> > > which case we would then call abortTxn again.
> > >
> > > But if we do this, the pattern becomes:
> > >
> > > try {
> > >    beginTxn()
> > >    // do something
> > > } catch (Exception) {
> > >    shouldCommit = false;
> > > }
> > >
> > > if (shouldCommit) {
> > >     try {
> > >         commitTxn()
> > >     } catch (...) {        // enumerate all fatal exceptions
> > >         shutdown()
> > >     } catch (KafkaException) {
> > >         // non-fatal
> > >         shouldCommit = false;
> > >     }
> > > }
> > >
> > > if (!shouldCommit && running()) {
> > > try {
> > >         abortTxn()
> > >     } catch (KafkaException) {
> > >         // only throw fatal
> > >         shutdown()
> > >     }
> > > }
> > >
> > > ---------------------
> > >
> > > Which is much more complicated.
> > >
> > > Thank makes me think, the alternative we have discussed offline may be
> > > better: let commitTxn() to return a boolean flag.
> > >
> > > * If it returns true, it means the commit succeeded. Users can
> > comfortably
> > > continue and do any external non-rollback operations if they like.
> > > * If it returns false, it means the commit failed with non-fatal error
> > *AND
> > > the txn has been aborted*. Users do not need to call abortTxn again.
> > > * If it throws, then it means fatal errors. Users should shut down the
> > > client.
> > >
> > > In this case, the pattern becomes:
> > >
> > > try {
> > >    beginTxn()
> > >    // do something
> > > } catch (Exception) {
> > >    shouldCommit = false;
> > > }
> > >
> > > try {
> > >     if (shouldCommit) {
> > >         commitSucceeded = commitTxn()
> > >     } else {
> > >         // reset offsets, rollback operations, etc
> > >         abortTxn()
> > >     }
> > > } catch (KafkaException) {
> > >     // only throw fatal
> > >     shutdown()
> > > }
> > >
> > > if (commitSucceeded)
> > >    // do other non-rollbackable things
> > > else
> > >    // reset offsets, rollback operations, etc
> > >
> > > -------------------------
> > >
> > > Of course, if we want to have better visibility into what caused the
> > commit
> > > to fail and txn to abort. We can let the return type be an enum instead
> > of
> > > a flag. But my main idea is to still let the commitTxn be the final
> point
> > > users can tell whether this txn succeeded or not, instead of relying on
> > the
> > > next beginTxn() call.
> > >
> > > I agree that adding a boolean flag is indeed useful in this case. Will
> > update the KIP.
> >
> > 1. Re: "while maintaining the behavior to throw fatal exception in raw"
> not
> > > sure what do you mean by "throw" here. Are you proposing the callback
> > would
> > > *pass* (not throw) in any fatal exceptions when triggered without
> > wrapping?
> > >
> > > Yes, I want to say *pass*, the benefit is to make the end user's
> > expectation consistent
> > regarding exception handling.
> >
> >
> > > 2. I'm not sure I understand the claim regarding the callback that "In
> > EOS
> > > setup, it is not required to handle the exception". Are you proposing
> > that,
> > > e.g. in Streams, we do not try to handle any exceptions if EOS is
> enabled
> > > in the callback anymore, but just let commitTxn() itself to fail to be
> > > notified about the problem? Personally I think in Streams we should
> just
> > > make the handling logic of the callback to be consistent regardless of
> > the
> > > EOS settings (today we have different logic depending on this logic,
> > which
> > > I think could be unified as well).
> > >
> > > My idea originates from the claim on send API:
> > "When used as part of a transaction, it is not necessary to define a
> > callback or check the result of the future  in order to detect errors
> from
> > <code>send</code>. "
> > My understanding is that for EOS, the exception will be detected by
> calling
> > transactional APIs either way, so it is a duplicate handling to track
> > all the sendExceptions in RecordCollector. However, I looked up
> > sendException is being used today as follow:
> >
> > 1. Pass to "ProductionExceptionHandler" for any default or customized
> > exception handler to handle
> > 2. Stop collecting offset info or new exceptions
> > 3. Check and rethrow exceptions in close(), flush() or new send() calls
> >
> > To my understanding, we should still honor the commitment to #1 for any
> > user customized implementation. The #2 does not really affect our
> decision
> > upon EOS. The #3 here is still valuable as it could help us fail fast in
> > new send() instead of waiting to later stage of processing. In that
> sense,
> > I agree we should continue to record send exceptions even under EOS case
> to
> > ensure the strength of stream side Producer logic. On the safer side, we
> no
> > longer need to wrap certain fatal exceptions like ProducerFenced as
> > TaskMigrated, since they should not crash the stream thread if thrown in
> > raw format, once we adopt the new processing model in the send phase.
> >
> >
> > >
> > > Guozhang
> > >
> > >
> > >
> > > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Thanks for everyone's feedback so far. I have polished the KIP after
> > > > offline discussion with some folks working on EOS to make the
> exception
> > > > handling more lightweight. The essential change is that we are not
> > > > inventing a new intermediate exception type, but instead separating a
> > > full
> > > > transaction into two phases:
> > > >
> > > > 1. The data transmission phase
> > > > 2. The commit phase
> > > >
> > > > This way for any exception thrown from phase 1, will be an indicator
> in
> > > > phase 2 whether we should do commit or abort, and from now on
> > > > `commitTransaction` should only throw fatal exceptions, similar to
> > > > `abortTransaction`, so that any KafkaException caught in the commit
> > phase
> > > > will be definitely fatal to crash the app. For more advanced users
> such
> > > as
> > > > Streams, we have the ability to further wrap selected types of fatal
> > > > exceptions to trigger task migration if necessary.
> > > >
> > > > More details in the KIP, feel free to take another look, thanks!
> > > >
> > > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > > > wrote:
> > > >
> > > > > Thanks Bruno for the feedback.
> > > > >
> > > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <br...@confluent.io>
> > > wrote:
> > > > >
> > > > >> Thanks Boyang for the KIP!
> > > > >>
> > > > >> Like Matthias, I do also not know the producer internal well
> enough
> > to
> > > > >> comment on the categorization. However, I think having a super
> > > exception
> > > > >> (e.g. RetriableException) that  encodes if an exception is fatal
> or
> > > not
> > > > >> is cleaner, better understandable and less error-prone, because
> > > ideally
> > > > >> when you add a new non-fatal exception in future you just need to
> > > think
> > > > >> about letting it inherit from the super exception and all the rest
> > of
> > > > >> the code will just behave correctly without the need to wrap the
> new
> > > > >> exception into another exception each time it is thrown (maybe it
> is
> > > > >> thrown at different location in the code).
> > > > >>
> > > > >> As far as I understand the following statement from your previous
> > > e-mail
> > > > >> is the reason that currently such a super exception is not
> possible:
> > > > >>
> > > > >> "Right now we have RetriableException type, if we are going to
> add a
> > > > >> `ProducerRetriableException` type, we have to put this new
> interface
> > > as
> > > > >> the parent of the RetriableException, because not all thrown
> > non-fatal
> > > > >> exceptions are `retriable` in general for producer"
> > > > >>
> > > > >>
> > > > >> In the list of exceptions in your KIP, I found non-fatal
> exceptions
> > > that
> > > > >> do not inherit from RetriableException. I guess those are the ones
> > you
> > > > >> are referring to in your statement:
> > > > >>
> > > > >> InvalidProducerEpochException
> > > > >> InvalidPidMappingException
> > > > >> TransactionAbortedException
> > > > >>
> > > > >> All of those exceptions are non-fatal and do not inherit from
> > > > >> RetriableException. Is there a reason for that? If they depended
> > from
> > > > >> RetriableException we would be a bit closer to a super exception I
> > > > >> mention above.
> > > > >>
> > > > >> The reason is that sender may catch those exceptions in the
> > > > > ProduceResponse, and it currently does infinite
> > > > > retries on RetriableException. To make sure we could trigger the
> > > > > abortTransaction() by catching non-fatal thrown
> > > > > exceptions and reinitialize the txn state, we chose not to let
> those
> > > > > exceptions inherit RetriableException so that
> > > > > they won't cause infinite retry on sender.
> > > > >
> > > > >
> > > > >> With OutOfOrderSequenceException and UnknownProducerIdException, I
> > > think
> > > > >> to understand that their fatality depends on the type (i.e.
> > > > >> configuration) of the producer. That makes it difficult to have a
> > > super
> > > > >> exception that encodes the retriability as mentioned above. Would
> it
> > > be
> > > > >> possible to introduce exceptions that inherit from
> > RetriableException
> > > > >> and that are thrown when those exceptions are caught from the
> > brokers
> > > > >> and the type of the producer is such that the exceptions are
> > > retriable?
> > > > >>
> > > > >> Yea, I think in general the exception type mixing is difficult to
> > get
> > > > > them all right. I have already proposed another solution based on
> my
> > > > > offline discussion with some folks working on EOS
> > > > > to make the handling more straightforward for end users without the
> > > need
> > > > > to distinguish exception fatality.
> > > > >
> > > > >> Best,
> > > > >> Bruno
> > > > >>
> > > > >>
> > > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > > >> > Thanks Boyang for the proposal! I made a pass over the list and
> > here
> > > > are
> > > > >> > some thoughts:
> > > > >> >
> > > > >> > 0) Although this is not part of the public API, I think we
> should
> > > make
> > > > >> sure
> > > > >> > that the suggested pattern (i.e. user can always call abortTxn()
> > > when
> > > > >> > handling non-fatal errors) are indeed supported. E.g. if the txn
> > is
> > > > >> already
> > > > >> > aborted by the broker side, then users can still call abortTxn
> > which
> > > > >> would
> > > > >> > not throw another exception but just be treated as a no-op.
> > > > >> >
> > > > >> > 1) *ConcurrentTransactionsException*: I think this error can
> also
> > be
> > > > >> > returned but not documented yet. This should be a non-fatal
> error.
> > > > >> >
> > > > >> > 2) *InvalidTxnStateException*: this error is returned from
> broker
> > > when
> > > > >> txn
> > > > >> > state transition failed (e.g. it is trying to transit to
> > > > complete-commit
> > > > >> > while the current state is not prepare-commit). This error could
> > > > >> indicates
> > > > >> > a bug on the client internal code or the broker code, OR a user
> > > error
> > > > >> --- a
> > > > >> > similar error is ConcurrentTransactionsException, i.e. if Kafka
> is
> > > > >> bug-free
> > > > >> > these exceptions would only be returned if users try to do
> > something
> > > > >> wrong,
> > > > >> > e.g. calling abortTxn right after a commitTxn, etc. So I'm
> > thinking
> > > it
> > > > >> > should be a non-fatal error instead of a fatal error, wdyt?
> > > > >> >
> > > > >> > 3) *KafkaException*: case i "indicates fatal transactional
> > sequence
> > > > >> > (Fatal)", this is a bit conflicting with the
> > > *OutOfSequenceException*
> > > > >> that
> > > > >> > is treated as non-fatal. I guess your proposal is that
> > > > >> > OutOfOrderSequenceException would be treated either as fatal
> with
> > > > >> > transactional producer, or non-fatal with idempotent producer,
> is
> > > that
> > > > >> > right? If the producer is only configured with idempotency but
> not
> > > > >> > transaction, then throwing a TransactionStateCorruptedException
> > for
> > > > >> > non-fatal errors would be confusing since users are not using
> > > > >> transactions
> > > > >> > at all.. So I suggest we always throw OutOfSequenceException
> as-is
> > > > (i.e.
> > > > >> > not wrapped) no matter how the producer is configured, and let
> the
> > > > >> caller
> > > > >> > decide how to handle it based on whether it is only idempotent
> or
> > > > >> > transactional itself.
> > > > >> >
> > > > >> > 4) Besides all the txn APIs, the `send()` callback / future can
> > also
> > > > >> throw
> > > > >> > txn-related exceptions, I think this KIP should also cover this
> > API
> > > as
> > > > >> well?
> > > > >> >
> > > > >> > 5) This is related to 1/2) above: sometimes those non-fatal
> errors
> > > > like
> > > > >> > ConcurrentTxn or InvalidTxnState are not due to the state being
> > > > >> corrupted
> > > > >> > at the broker side, but maybe users are doing something wrong.
> So
> > > I'm
> > > > >> > wondering if we should further distinguish those non-fatal
> errors
> > > > >> between
> > > > >> > a) those that are caused by Kafka itself, e.g. a broker timed
> out
> > > and
> > > > >> > aborted a txn and later an endTxn request is received, and b)
> the
> > > > user's
> > > > >> > API call pattern is incorrect, causing the request to be
> rejected
> > > with
> > > > >> an
> > > > >> > error code from the broker. *TransactionStateCorruptedException*
> > > feels
> > > > >> to
> > > > >> > me more like for case a), but not case b).
> > > > >> >
> > > > >> >
> > > > >> > Guozhang
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > > > reluctanthero104@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> >> Thanks Matthias, I think your proposal makes sense as well, on
> > the
> > > > pro
> > > > >> side
> > > > >> >> we could have a universally agreed exception type to be caught
> by
> > > the
> > > > >> user,
> > > > >> >> without having an extra layer on top of the actual exceptions.
> I
> > > > could
> > > > >> see
> > > > >> >> some issue on downsides:
> > > > >> >>
> > > > >> >> 1. The exception hierarchy will be more complex. Right now we
> > have
> > > > >> >> RetriableException type, if we are going to add a
> > > > >> >> `ProducerRetriableException` type, we have to put this new
> > > interface
> > > > >> as the
> > > > >> >> parent of the RetriableException, because not all thrown
> > non-fatal
> > > > >> >> exceptions are `retriable` in general for producer, for example
> > > > >> >> <
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > > >> >>> .
> > > > >> >> We could have an empty interface `ProducerRetriableException`
> to
> > > let
> > > > >> all
> > > > >> >> the thrown exceptions implement for sure, but it's a bit
> > unorthodox
> > > > in
> > > > >> the
> > > > >> >> way we deal with exceptions in general.
> > > > >> >>
> > > > >> >> 2. There are cases where we throw a KafkaException wrapping
> > another
> > > > >> >> KafkaException as either fatal or non-fatal. If we use an
> > interface
> > > > to
> > > > >> >> solve #1, it is also required to implement another bloated
> > > exception
> > > > >> class
> > > > >> >> which could replace KafkaException type, as we couldn't mark
> > > > >> KafkaException
> > > > >> >> as retriable for sure.
> > > > >> >>
> > > > >> >> 3. In terms of the encapsulation, wrapping means we could limit
> > the
> > > > >> scope
> > > > >> >> of affection to the producer only, which is important since we
> > > don't
> > > > >> want
> > > > >> >> shared exception types to implement a producer-related
> interface,
> > > > such
> > > > >> >> as UnknownTopicOrPartitionException.
> > > > >> >>
> > > > >> >> Best,
> > > > >> >> Boyang
> > > > >> >>
> > > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <
> mjsax@apache.org
> > >
> > > > >> wrote:
> > > > >> >>
> > > > >> >>> Thanks for the KIP Boyang!
> > > > >> >>>
> > > > >> >>> Overall, categorizing exceptions makes a lot of sense. As I
> > don't
> > > > know
> > > > >> >>> the producer internals well enough, I cannot comment on the
> > > > >> >>> categorization in detail though.
> > > > >> >>>
> > > > >> >>> What I am wondering is, if we should introduce an exception
> > > > interface
> > > > >> >>> that non-fatal exception would implement instead of creating a
> > new
> > > > >> class
> > > > >> >>> that will wrap non-fatal exceptions? What would be the
> pros/cons
> > > for
> > > > >> >>> both designs?
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> -Matthias
> > > > >> >>>
> > > > >> >>>
> > > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > > >> >>>> Hey there,
> > > > >> >>>>
> > > > >> >>>> I would like to start a discussion thread for KIP-691:
> > > > >> >>>>
> > > > >> >>>
> > > > >> >>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > > >> >>>>
> > > > >> >>>> The KIP is aiming to simplify the exception handling logic
> for
> > > > >> >>>> transactional Producer users by classifying fatal and
> non-fatal
> > > > >> >>> exceptions
> > > > >> >>>> and throw them correspondingly for easier catch and retry.
> Let
> > me
> > > > >> know
> > > > >> >>> what
> > > > >> >>>> you think.
> > > > >> >>>>
> > > > >> >>>> Best,
> > > > >> >>>> Boyang
> > > > >> >>>>
> > > > >> >>>
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Guozhang Wang <wa...@gmail.com>.
Thanks for your clarification on 2)/3), that makes sense.

On Tue, Jan 19, 2021 at 10:16 AM Boyang Chen <re...@gmail.com>
wrote:

> Thanks for the input Guozhang, replied inline.
>
> On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hello Boyang,
> >
> > Thanks for the updated KIP. I read it again and have the following
> > thoughts:
> >
> > 0. I'm a bit concerned that if commitTxn does not throw any non-fatal
> > exception, and instead we rely on the subsequent beginTxn call to throw,
> it
> > may violate some callers with a pattern that relying on commitTxn to
> > succeed to make some non-rollback operations. For example:
> >
> > beginTxn()
> > // do some read-write on my local DB
> > commitTxn()
> > // if commitTxn succeeds, then commit the DB
> >
> > -------------
> >
> > The issue is that, committing DB is a non-rollback operation, and users
> may
> > just rely on commitTxn to return without error to make this non-rollback
> > call. Of course we can just claim this pattern is not legitimate and is
> not
> > the right way of doing things, but many users may naturally adopt this
> > pattern.
> >
> > So maybe we should still let commitTxn also throw non-fatal exceptions,
> in
> > which case we would then call abortTxn again.
> >
> > But if we do this, the pattern becomes:
> >
> > try {
> >    beginTxn()
> >    // do something
> > } catch (Exception) {
> >    shouldCommit = false;
> > }
> >
> > if (shouldCommit) {
> >     try {
> >         commitTxn()
> >     } catch (...) {        // enumerate all fatal exceptions
> >         shutdown()
> >     } catch (KafkaException) {
> >         // non-fatal
> >         shouldCommit = false;
> >     }
> > }
> >
> > if (!shouldCommit && running()) {
> > try {
> >         abortTxn()
> >     } catch (KafkaException) {
> >         // only throw fatal
> >         shutdown()
> >     }
> > }
> >
> > ---------------------
> >
> > Which is much more complicated.
> >
> > Thank makes me think, the alternative we have discussed offline may be
> > better: let commitTxn() to return a boolean flag.
> >
> > * If it returns true, it means the commit succeeded. Users can
> comfortably
> > continue and do any external non-rollback operations if they like.
> > * If it returns false, it means the commit failed with non-fatal error
> *AND
> > the txn has been aborted*. Users do not need to call abortTxn again.
> > * If it throws, then it means fatal errors. Users should shut down the
> > client.
> >
> > In this case, the pattern becomes:
> >
> > try {
> >    beginTxn()
> >    // do something
> > } catch (Exception) {
> >    shouldCommit = false;
> > }
> >
> > try {
> >     if (shouldCommit) {
> >         commitSucceeded = commitTxn()
> >     } else {
> >         // reset offsets, rollback operations, etc
> >         abortTxn()
> >     }
> > } catch (KafkaException) {
> >     // only throw fatal
> >     shutdown()
> > }
> >
> > if (commitSucceeded)
> >    // do other non-rollbackable things
> > else
> >    // reset offsets, rollback operations, etc
> >
> > -------------------------
> >
> > Of course, if we want to have better visibility into what caused the
> commit
> > to fail and txn to abort. We can let the return type be an enum instead
> of
> > a flag. But my main idea is to still let the commitTxn be the final point
> > users can tell whether this txn succeeded or not, instead of relying on
> the
> > next beginTxn() call.
> >
> > I agree that adding a boolean flag is indeed useful in this case. Will
> update the KIP.
>
> 1. Re: "while maintaining the behavior to throw fatal exception in raw" not
> > sure what do you mean by "throw" here. Are you proposing the callback
> would
> > *pass* (not throw) in any fatal exceptions when triggered without
> wrapping?
> >
> > Yes, I want to say *pass*, the benefit is to make the end user's
> expectation consistent
> regarding exception handling.
>
>
> > 2. I'm not sure I understand the claim regarding the callback that "In
> EOS
> > setup, it is not required to handle the exception". Are you proposing
> that,
> > e.g. in Streams, we do not try to handle any exceptions if EOS is enabled
> > in the callback anymore, but just let commitTxn() itself to fail to be
> > notified about the problem? Personally I think in Streams we should just
> > make the handling logic of the callback to be consistent regardless of
> the
> > EOS settings (today we have different logic depending on this logic,
> which
> > I think could be unified as well).
> >
> > My idea originates from the claim on send API:
> "When used as part of a transaction, it is not necessary to define a
> callback or check the result of the future  in order to detect errors from
> <code>send</code>. "
> My understanding is that for EOS, the exception will be detected by calling
> transactional APIs either way, so it is a duplicate handling to track
> all the sendExceptions in RecordCollector. However, I looked up
> sendException is being used today as follow:
>
> 1. Pass to "ProductionExceptionHandler" for any default or customized
> exception handler to handle
> 2. Stop collecting offset info or new exceptions
> 3. Check and rethrow exceptions in close(), flush() or new send() calls
>
> To my understanding, we should still honor the commitment to #1 for any
> user customized implementation. The #2 does not really affect our decision
> upon EOS. The #3 here is still valuable as it could help us fail fast in
> new send() instead of waiting to later stage of processing. In that sense,
> I agree we should continue to record send exceptions even under EOS case to
> ensure the strength of stream side Producer logic. On the safer side, we no
> longer need to wrap certain fatal exceptions like ProducerFenced as
> TaskMigrated, since they should not crash the stream thread if thrown in
> raw format, once we adopt the new processing model in the send phase.
>
>
> >
> > Guozhang
> >
> >
> >
> > On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Thanks for everyone's feedback so far. I have polished the KIP after
> > > offline discussion with some folks working on EOS to make the exception
> > > handling more lightweight. The essential change is that we are not
> > > inventing a new intermediate exception type, but instead separating a
> > full
> > > transaction into two phases:
> > >
> > > 1. The data transmission phase
> > > 2. The commit phase
> > >
> > > This way for any exception thrown from phase 1, will be an indicator in
> > > phase 2 whether we should do commit or abort, and from now on
> > > `commitTransaction` should only throw fatal exceptions, similar to
> > > `abortTransaction`, so that any KafkaException caught in the commit
> phase
> > > will be definitely fatal to crash the app. For more advanced users such
> > as
> > > Streams, we have the ability to further wrap selected types of fatal
> > > exceptions to trigger task migration if necessary.
> > >
> > > More details in the KIP, feel free to take another look, thanks!
> > >
> > > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <
> reluctanthero104@gmail.com>
> > > wrote:
> > >
> > > > Thanks Bruno for the feedback.
> > > >
> > > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <br...@confluent.io>
> > wrote:
> > > >
> > > >> Thanks Boyang for the KIP!
> > > >>
> > > >> Like Matthias, I do also not know the producer internal well enough
> to
> > > >> comment on the categorization. However, I think having a super
> > exception
> > > >> (e.g. RetriableException) that  encodes if an exception is fatal or
> > not
> > > >> is cleaner, better understandable and less error-prone, because
> > ideally
> > > >> when you add a new non-fatal exception in future you just need to
> > think
> > > >> about letting it inherit from the super exception and all the rest
> of
> > > >> the code will just behave correctly without the need to wrap the new
> > > >> exception into another exception each time it is thrown (maybe it is
> > > >> thrown at different location in the code).
> > > >>
> > > >> As far as I understand the following statement from your previous
> > e-mail
> > > >> is the reason that currently such a super exception is not possible:
> > > >>
> > > >> "Right now we have RetriableException type, if we are going to add a
> > > >> `ProducerRetriableException` type, we have to put this new interface
> > as
> > > >> the parent of the RetriableException, because not all thrown
> non-fatal
> > > >> exceptions are `retriable` in general for producer"
> > > >>
> > > >>
> > > >> In the list of exceptions in your KIP, I found non-fatal exceptions
> > that
> > > >> do not inherit from RetriableException. I guess those are the ones
> you
> > > >> are referring to in your statement:
> > > >>
> > > >> InvalidProducerEpochException
> > > >> InvalidPidMappingException
> > > >> TransactionAbortedException
> > > >>
> > > >> All of those exceptions are non-fatal and do not inherit from
> > > >> RetriableException. Is there a reason for that? If they depended
> from
> > > >> RetriableException we would be a bit closer to a super exception I
> > > >> mention above.
> > > >>
> > > >> The reason is that sender may catch those exceptions in the
> > > > ProduceResponse, and it currently does infinite
> > > > retries on RetriableException. To make sure we could trigger the
> > > > abortTransaction() by catching non-fatal thrown
> > > > exceptions and reinitialize the txn state, we chose not to let those
> > > > exceptions inherit RetriableException so that
> > > > they won't cause infinite retry on sender.
> > > >
> > > >
> > > >> With OutOfOrderSequenceException and UnknownProducerIdException, I
> > think
> > > >> to understand that their fatality depends on the type (i.e.
> > > >> configuration) of the producer. That makes it difficult to have a
> > super
> > > >> exception that encodes the retriability as mentioned above. Would it
> > be
> > > >> possible to introduce exceptions that inherit from
> RetriableException
> > > >> and that are thrown when those exceptions are caught from the
> brokers
> > > >> and the type of the producer is such that the exceptions are
> > retriable?
> > > >>
> > > >> Yea, I think in general the exception type mixing is difficult to
> get
> > > > them all right. I have already proposed another solution based on my
> > > > offline discussion with some folks working on EOS
> > > > to make the handling more straightforward for end users without the
> > need
> > > > to distinguish exception fatality.
> > > >
> > > >> Best,
> > > >> Bruno
> > > >>
> > > >>
> > > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > > >> > Thanks Boyang for the proposal! I made a pass over the list and
> here
> > > are
> > > >> > some thoughts:
> > > >> >
> > > >> > 0) Although this is not part of the public API, I think we should
> > make
> > > >> sure
> > > >> > that the suggested pattern (i.e. user can always call abortTxn()
> > when
> > > >> > handling non-fatal errors) are indeed supported. E.g. if the txn
> is
> > > >> already
> > > >> > aborted by the broker side, then users can still call abortTxn
> which
> > > >> would
> > > >> > not throw another exception but just be treated as a no-op.
> > > >> >
> > > >> > 1) *ConcurrentTransactionsException*: I think this error can also
> be
> > > >> > returned but not documented yet. This should be a non-fatal error.
> > > >> >
> > > >> > 2) *InvalidTxnStateException*: this error is returned from broker
> > when
> > > >> txn
> > > >> > state transition failed (e.g. it is trying to transit to
> > > complete-commit
> > > >> > while the current state is not prepare-commit). This error could
> > > >> indicates
> > > >> > a bug on the client internal code or the broker code, OR a user
> > error
> > > >> --- a
> > > >> > similar error is ConcurrentTransactionsException, i.e. if Kafka is
> > > >> bug-free
> > > >> > these exceptions would only be returned if users try to do
> something
> > > >> wrong,
> > > >> > e.g. calling abortTxn right after a commitTxn, etc. So I'm
> thinking
> > it
> > > >> > should be a non-fatal error instead of a fatal error, wdyt?
> > > >> >
> > > >> > 3) *KafkaException*: case i "indicates fatal transactional
> sequence
> > > >> > (Fatal)", this is a bit conflicting with the
> > *OutOfSequenceException*
> > > >> that
> > > >> > is treated as non-fatal. I guess your proposal is that
> > > >> > OutOfOrderSequenceException would be treated either as fatal with
> > > >> > transactional producer, or non-fatal with idempotent producer, is
> > that
> > > >> > right? If the producer is only configured with idempotency but not
> > > >> > transaction, then throwing a TransactionStateCorruptedException
> for
> > > >> > non-fatal errors would be confusing since users are not using
> > > >> transactions
> > > >> > at all.. So I suggest we always throw OutOfSequenceException as-is
> > > (i.e.
> > > >> > not wrapped) no matter how the producer is configured, and let the
> > > >> caller
> > > >> > decide how to handle it based on whether it is only idempotent or
> > > >> > transactional itself.
> > > >> >
> > > >> > 4) Besides all the txn APIs, the `send()` callback / future can
> also
> > > >> throw
> > > >> > txn-related exceptions, I think this KIP should also cover this
> API
> > as
> > > >> well?
> > > >> >
> > > >> > 5) This is related to 1/2) above: sometimes those non-fatal errors
> > > like
> > > >> > ConcurrentTxn or InvalidTxnState are not due to the state being
> > > >> corrupted
> > > >> > at the broker side, but maybe users are doing something wrong. So
> > I'm
> > > >> > wondering if we should further distinguish those non-fatal errors
> > > >> between
> > > >> > a) those that are caused by Kafka itself, e.g. a broker timed out
> > and
> > > >> > aborted a txn and later an endTxn request is received, and b) the
> > > user's
> > > >> > API call pattern is incorrect, causing the request to be rejected
> > with
> > > >> an
> > > >> > error code from the broker. *TransactionStateCorruptedException*
> > feels
> > > >> to
> > > >> > me more like for case a), but not case b).
> > > >> >
> > > >> >
> > > >> > Guozhang
> > > >> >
> > > >> >
> > > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > > reluctanthero104@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> >> Thanks Matthias, I think your proposal makes sense as well, on
> the
> > > pro
> > > >> side
> > > >> >> we could have a universally agreed exception type to be caught by
> > the
> > > >> user,
> > > >> >> without having an extra layer on top of the actual exceptions. I
> > > could
> > > >> see
> > > >> >> some issue on downsides:
> > > >> >>
> > > >> >> 1. The exception hierarchy will be more complex. Right now we
> have
> > > >> >> RetriableException type, if we are going to add a
> > > >> >> `ProducerRetriableException` type, we have to put this new
> > interface
> > > >> as the
> > > >> >> parent of the RetriableException, because not all thrown
> non-fatal
> > > >> >> exceptions are `retriable` in general for producer, for example
> > > >> >> <
> > > >> >>
> > > >>
> > >
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > > >> >>> .
> > > >> >> We could have an empty interface `ProducerRetriableException` to
> > let
> > > >> all
> > > >> >> the thrown exceptions implement for sure, but it's a bit
> unorthodox
> > > in
> > > >> the
> > > >> >> way we deal with exceptions in general.
> > > >> >>
> > > >> >> 2. There are cases where we throw a KafkaException wrapping
> another
> > > >> >> KafkaException as either fatal or non-fatal. If we use an
> interface
> > > to
> > > >> >> solve #1, it is also required to implement another bloated
> > exception
> > > >> class
> > > >> >> which could replace KafkaException type, as we couldn't mark
> > > >> KafkaException
> > > >> >> as retriable for sure.
> > > >> >>
> > > >> >> 3. In terms of the encapsulation, wrapping means we could limit
> the
> > > >> scope
> > > >> >> of affection to the producer only, which is important since we
> > don't
> > > >> want
> > > >> >> shared exception types to implement a producer-related interface,
> > > such
> > > >> >> as UnknownTopicOrPartitionException.
> > > >> >>
> > > >> >> Best,
> > > >> >> Boyang
> > > >> >>
> > > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <mjsax@apache.org
> >
> > > >> wrote:
> > > >> >>
> > > >> >>> Thanks for the KIP Boyang!
> > > >> >>>
> > > >> >>> Overall, categorizing exceptions makes a lot of sense. As I
> don't
> > > know
> > > >> >>> the producer internals well enough, I cannot comment on the
> > > >> >>> categorization in detail though.
> > > >> >>>
> > > >> >>> What I am wondering is, if we should introduce an exception
> > > interface
> > > >> >>> that non-fatal exception would implement instead of creating a
> new
> > > >> class
> > > >> >>> that will wrap non-fatal exceptions? What would be the pros/cons
> > for
> > > >> >>> both designs?
> > > >> >>>
> > > >> >>>
> > > >> >>> -Matthias
> > > >> >>>
> > > >> >>>
> > > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > >> >>>> Hey there,
> > > >> >>>>
> > > >> >>>> I would like to start a discussion thread for KIP-691:
> > > >> >>>>
> > > >> >>>
> > > >> >>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > > >> >>>>
> > > >> >>>> The KIP is aiming to simplify the exception handling logic for
> > > >> >>>> transactional Producer users by classifying fatal and non-fatal
> > > >> >>> exceptions
> > > >> >>>> and throw them correspondingly for easier catch and retry. Let
> me
> > > >> know
> > > >> >>> what
> > > >> >>>> you think.
> > > >> >>>>
> > > >> >>>> Best,
> > > >> >>>> Boyang
> > > >> >>>>
> > > >> >>>
> > > >> >>
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Thanks for the input Guozhang, replied inline.

On Mon, Jan 18, 2021 at 8:57 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hello Boyang,
>
> Thanks for the updated KIP. I read it again and have the following
> thoughts:
>
> 0. I'm a bit concerned that if commitTxn does not throw any non-fatal
> exception, and instead we rely on the subsequent beginTxn call to throw, it
> may violate some callers with a pattern that relying on commitTxn to
> succeed to make some non-rollback operations. For example:
>
> beginTxn()
> // do some read-write on my local DB
> commitTxn()
> // if commitTxn succeeds, then commit the DB
>
> -------------
>
> The issue is that, committing DB is a non-rollback operation, and users may
> just rely on commitTxn to return without error to make this non-rollback
> call. Of course we can just claim this pattern is not legitimate and is not
> the right way of doing things, but many users may naturally adopt this
> pattern.
>
> So maybe we should still let commitTxn also throw non-fatal exceptions, in
> which case we would then call abortTxn again.
>
> But if we do this, the pattern becomes:
>
> try {
>    beginTxn()
>    // do something
> } catch (Exception) {
>    shouldCommit = false;
> }
>
> if (shouldCommit) {
>     try {
>         commitTxn()
>     } catch (...) {        // enumerate all fatal exceptions
>         shutdown()
>     } catch (KafkaException) {
>         // non-fatal
>         shouldCommit = false;
>     }
> }
>
> if (!shouldCommit && running()) {
> try {
>         abortTxn()
>     } catch (KafkaException) {
>         // only throw fatal
>         shutdown()
>     }
> }
>
> ---------------------
>
> Which is much more complicated.
>
> Thank makes me think, the alternative we have discussed offline may be
> better: let commitTxn() to return a boolean flag.
>
> * If it returns true, it means the commit succeeded. Users can comfortably
> continue and do any external non-rollback operations if they like.
> * If it returns false, it means the commit failed with non-fatal error *AND
> the txn has been aborted*. Users do not need to call abortTxn again.
> * If it throws, then it means fatal errors. Users should shut down the
> client.
>
> In this case, the pattern becomes:
>
> try {
>    beginTxn()
>    // do something
> } catch (Exception) {
>    shouldCommit = false;
> }
>
> try {
>     if (shouldCommit) {
>         commitSucceeded = commitTxn()
>     } else {
>         // reset offsets, rollback operations, etc
>         abortTxn()
>     }
> } catch (KafkaException) {
>     // only throw fatal
>     shutdown()
> }
>
> if (commitSucceeded)
>    // do other non-rollbackable things
> else
>    // reset offsets, rollback operations, etc
>
> -------------------------
>
> Of course, if we want to have better visibility into what caused the commit
> to fail and txn to abort. We can let the return type be an enum instead of
> a flag. But my main idea is to still let the commitTxn be the final point
> users can tell whether this txn succeeded or not, instead of relying on the
> next beginTxn() call.
>
> I agree that adding a boolean flag is indeed useful in this case. Will
update the KIP.

1. Re: "while maintaining the behavior to throw fatal exception in raw" not
> sure what do you mean by "throw" here. Are you proposing the callback would
> *pass* (not throw) in any fatal exceptions when triggered without wrapping?
>
> Yes, I want to say *pass*, the benefit is to make the end user's
expectation consistent
regarding exception handling.


> 2. I'm not sure I understand the claim regarding the callback that "In EOS
> setup, it is not required to handle the exception". Are you proposing that,
> e.g. in Streams, we do not try to handle any exceptions if EOS is enabled
> in the callback anymore, but just let commitTxn() itself to fail to be
> notified about the problem? Personally I think in Streams we should just
> make the handling logic of the callback to be consistent regardless of the
> EOS settings (today we have different logic depending on this logic, which
> I think could be unified as well).
>
> My idea originates from the claim on send API:
"When used as part of a transaction, it is not necessary to define a
callback or check the result of the future  in order to detect errors from
<code>send</code>. "
My understanding is that for EOS, the exception will be detected by calling
transactional APIs either way, so it is a duplicate handling to track
all the sendExceptions in RecordCollector. However, I looked up
sendException is being used today as follow:

1. Pass to "ProductionExceptionHandler" for any default or customized
exception handler to handle
2. Stop collecting offset info or new exceptions
3. Check and rethrow exceptions in close(), flush() or new send() calls

To my understanding, we should still honor the commitment to #1 for any
user customized implementation. The #2 does not really affect our decision
upon EOS. The #3 here is still valuable as it could help us fail fast in
new send() instead of waiting to later stage of processing. In that sense,
I agree we should continue to record send exceptions even under EOS case to
ensure the strength of stream side Producer logic. On the safer side, we no
longer need to wrap certain fatal exceptions like ProducerFenced as
TaskMigrated, since they should not crash the stream thread if thrown in
raw format, once we adopt the new processing model in the send phase.


>
> Guozhang
>
>
>
> On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Thanks for everyone's feedback so far. I have polished the KIP after
> > offline discussion with some folks working on EOS to make the exception
> > handling more lightweight. The essential change is that we are not
> > inventing a new intermediate exception type, but instead separating a
> full
> > transaction into two phases:
> >
> > 1. The data transmission phase
> > 2. The commit phase
> >
> > This way for any exception thrown from phase 1, will be an indicator in
> > phase 2 whether we should do commit or abort, and from now on
> > `commitTransaction` should only throw fatal exceptions, similar to
> > `abortTransaction`, so that any KafkaException caught in the commit phase
> > will be definitely fatal to crash the app. For more advanced users such
> as
> > Streams, we have the ability to further wrap selected types of fatal
> > exceptions to trigger task migration if necessary.
> >
> > More details in the KIP, feel free to take another look, thanks!
> >
> > On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> > > Thanks Bruno for the feedback.
> > >
> > > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <br...@confluent.io>
> wrote:
> > >
> > >> Thanks Boyang for the KIP!
> > >>
> > >> Like Matthias, I do also not know the producer internal well enough to
> > >> comment on the categorization. However, I think having a super
> exception
> > >> (e.g. RetriableException) that  encodes if an exception is fatal or
> not
> > >> is cleaner, better understandable and less error-prone, because
> ideally
> > >> when you add a new non-fatal exception in future you just need to
> think
> > >> about letting it inherit from the super exception and all the rest of
> > >> the code will just behave correctly without the need to wrap the new
> > >> exception into another exception each time it is thrown (maybe it is
> > >> thrown at different location in the code).
> > >>
> > >> As far as I understand the following statement from your previous
> e-mail
> > >> is the reason that currently such a super exception is not possible:
> > >>
> > >> "Right now we have RetriableException type, if we are going to add a
> > >> `ProducerRetriableException` type, we have to put this new interface
> as
> > >> the parent of the RetriableException, because not all thrown non-fatal
> > >> exceptions are `retriable` in general for producer"
> > >>
> > >>
> > >> In the list of exceptions in your KIP, I found non-fatal exceptions
> that
> > >> do not inherit from RetriableException. I guess those are the ones you
> > >> are referring to in your statement:
> > >>
> > >> InvalidProducerEpochException
> > >> InvalidPidMappingException
> > >> TransactionAbortedException
> > >>
> > >> All of those exceptions are non-fatal and do not inherit from
> > >> RetriableException. Is there a reason for that? If they depended from
> > >> RetriableException we would be a bit closer to a super exception I
> > >> mention above.
> > >>
> > >> The reason is that sender may catch those exceptions in the
> > > ProduceResponse, and it currently does infinite
> > > retries on RetriableException. To make sure we could trigger the
> > > abortTransaction() by catching non-fatal thrown
> > > exceptions and reinitialize the txn state, we chose not to let those
> > > exceptions inherit RetriableException so that
> > > they won't cause infinite retry on sender.
> > >
> > >
> > >> With OutOfOrderSequenceException and UnknownProducerIdException, I
> think
> > >> to understand that their fatality depends on the type (i.e.
> > >> configuration) of the producer. That makes it difficult to have a
> super
> > >> exception that encodes the retriability as mentioned above. Would it
> be
> > >> possible to introduce exceptions that inherit from RetriableException
> > >> and that are thrown when those exceptions are caught from the brokers
> > >> and the type of the producer is such that the exceptions are
> retriable?
> > >>
> > >> Yea, I think in general the exception type mixing is difficult to get
> > > them all right. I have already proposed another solution based on my
> > > offline discussion with some folks working on EOS
> > > to make the handling more straightforward for end users without the
> need
> > > to distinguish exception fatality.
> > >
> > >> Best,
> > >> Bruno
> > >>
> > >>
> > >> On 04.12.20 19:34, Guozhang Wang wrote:
> > >> > Thanks Boyang for the proposal! I made a pass over the list and here
> > are
> > >> > some thoughts:
> > >> >
> > >> > 0) Although this is not part of the public API, I think we should
> make
> > >> sure
> > >> > that the suggested pattern (i.e. user can always call abortTxn()
> when
> > >> > handling non-fatal errors) are indeed supported. E.g. if the txn is
> > >> already
> > >> > aborted by the broker side, then users can still call abortTxn which
> > >> would
> > >> > not throw another exception but just be treated as a no-op.
> > >> >
> > >> > 1) *ConcurrentTransactionsException*: I think this error can also be
> > >> > returned but not documented yet. This should be a non-fatal error.
> > >> >
> > >> > 2) *InvalidTxnStateException*: this error is returned from broker
> when
> > >> txn
> > >> > state transition failed (e.g. it is trying to transit to
> > complete-commit
> > >> > while the current state is not prepare-commit). This error could
> > >> indicates
> > >> > a bug on the client internal code or the broker code, OR a user
> error
> > >> --- a
> > >> > similar error is ConcurrentTransactionsException, i.e. if Kafka is
> > >> bug-free
> > >> > these exceptions would only be returned if users try to do something
> > >> wrong,
> > >> > e.g. calling abortTxn right after a commitTxn, etc. So I'm thinking
> it
> > >> > should be a non-fatal error instead of a fatal error, wdyt?
> > >> >
> > >> > 3) *KafkaException*: case i "indicates fatal transactional sequence
> > >> > (Fatal)", this is a bit conflicting with the
> *OutOfSequenceException*
> > >> that
> > >> > is treated as non-fatal. I guess your proposal is that
> > >> > OutOfOrderSequenceException would be treated either as fatal with
> > >> > transactional producer, or non-fatal with idempotent producer, is
> that
> > >> > right? If the producer is only configured with idempotency but not
> > >> > transaction, then throwing a TransactionStateCorruptedException for
> > >> > non-fatal errors would be confusing since users are not using
> > >> transactions
> > >> > at all.. So I suggest we always throw OutOfSequenceException as-is
> > (i.e.
> > >> > not wrapped) no matter how the producer is configured, and let the
> > >> caller
> > >> > decide how to handle it based on whether it is only idempotent or
> > >> > transactional itself.
> > >> >
> > >> > 4) Besides all the txn APIs, the `send()` callback / future can also
> > >> throw
> > >> > txn-related exceptions, I think this KIP should also cover this API
> as
> > >> well?
> > >> >
> > >> > 5) This is related to 1/2) above: sometimes those non-fatal errors
> > like
> > >> > ConcurrentTxn or InvalidTxnState are not due to the state being
> > >> corrupted
> > >> > at the broker side, but maybe users are doing something wrong. So
> I'm
> > >> > wondering if we should further distinguish those non-fatal errors
> > >> between
> > >> > a) those that are caused by Kafka itself, e.g. a broker timed out
> and
> > >> > aborted a txn and later an endTxn request is received, and b) the
> > user's
> > >> > API call pattern is incorrect, causing the request to be rejected
> with
> > >> an
> > >> > error code from the broker. *TransactionStateCorruptedException*
> feels
> > >> to
> > >> > me more like for case a), but not case b).
> > >> >
> > >> >
> > >> > Guozhang
> > >> >
> > >> >
> > >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> > reluctanthero104@gmail.com>
> > >> > wrote:
> > >> >
> > >> >> Thanks Matthias, I think your proposal makes sense as well, on the
> > pro
> > >> side
> > >> >> we could have a universally agreed exception type to be caught by
> the
> > >> user,
> > >> >> without having an extra layer on top of the actual exceptions. I
> > could
> > >> see
> > >> >> some issue on downsides:
> > >> >>
> > >> >> 1. The exception hierarchy will be more complex. Right now we have
> > >> >> RetriableException type, if we are going to add a
> > >> >> `ProducerRetriableException` type, we have to put this new
> interface
> > >> as the
> > >> >> parent of the RetriableException, because not all thrown non-fatal
> > >> >> exceptions are `retriable` in general for producer, for example
> > >> >> <
> > >> >>
> > >>
> >
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> > >> >>> .
> > >> >> We could have an empty interface `ProducerRetriableException` to
> let
> > >> all
> > >> >> the thrown exceptions implement for sure, but it's a bit unorthodox
> > in
> > >> the
> > >> >> way we deal with exceptions in general.
> > >> >>
> > >> >> 2. There are cases where we throw a KafkaException wrapping another
> > >> >> KafkaException as either fatal or non-fatal. If we use an interface
> > to
> > >> >> solve #1, it is also required to implement another bloated
> exception
> > >> class
> > >> >> which could replace KafkaException type, as we couldn't mark
> > >> KafkaException
> > >> >> as retriable for sure.
> > >> >>
> > >> >> 3. In terms of the encapsulation, wrapping means we could limit the
> > >> scope
> > >> >> of affection to the producer only, which is important since we
> don't
> > >> want
> > >> >> shared exception types to implement a producer-related interface,
> > such
> > >> >> as UnknownTopicOrPartitionException.
> > >> >>
> > >> >> Best,
> > >> >> Boyang
> > >> >>
> > >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <mj...@apache.org>
> > >> wrote:
> > >> >>
> > >> >>> Thanks for the KIP Boyang!
> > >> >>>
> > >> >>> Overall, categorizing exceptions makes a lot of sense. As I don't
> > know
> > >> >>> the producer internals well enough, I cannot comment on the
> > >> >>> categorization in detail though.
> > >> >>>
> > >> >>> What I am wondering is, if we should introduce an exception
> > interface
> > >> >>> that non-fatal exception would implement instead of creating a new
> > >> class
> > >> >>> that will wrap non-fatal exceptions? What would be the pros/cons
> for
> > >> >>> both designs?
> > >> >>>
> > >> >>>
> > >> >>> -Matthias
> > >> >>>
> > >> >>>
> > >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > >> >>>> Hey there,
> > >> >>>>
> > >> >>>> I would like to start a discussion thread for KIP-691:
> > >> >>>>
> > >> >>>
> > >> >>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > >> >>>>
> > >> >>>> The KIP is aiming to simplify the exception handling logic for
> > >> >>>> transactional Producer users by classifying fatal and non-fatal
> > >> >>> exceptions
> > >> >>>> and throw them correspondingly for easier catch and retry. Let me
> > >> know
> > >> >>> what
> > >> >>>> you think.
> > >> >>>>
> > >> >>>> Best,
> > >> >>>> Boyang
> > >> >>>>
> > >> >>>
> > >> >>
> > >> >
> > >> >
> > >>
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Guozhang Wang <wa...@gmail.com>.
Hello Boyang,

Thanks for the updated KIP. I read it again and have the following thoughts:

0. I'm a bit concerned that if commitTxn does not throw any non-fatal
exception, and instead we rely on the subsequent beginTxn call to throw, it
may violate some callers with a pattern that relying on commitTxn to
succeed to make some non-rollback operations. For example:

beginTxn()
// do some read-write on my local DB
commitTxn()
// if commitTxn succeeds, then commit the DB

-------------

The issue is that, committing DB is a non-rollback operation, and users may
just rely on commitTxn to return without error to make this non-rollback
call. Of course we can just claim this pattern is not legitimate and is not
the right way of doing things, but many users may naturally adopt this
pattern.

So maybe we should still let commitTxn also throw non-fatal exceptions, in
which case we would then call abortTxn again.

But if we do this, the pattern becomes:

try {
   beginTxn()
   // do something
} catch (Exception) {
   shouldCommit = false;
}

if (shouldCommit) {
    try {
        commitTxn()
    } catch (...) {        // enumerate all fatal exceptions
        shutdown()
    } catch (KafkaException) {
        // non-fatal
        shouldCommit = false;
    }
}

if (!shouldCommit && running()) {
try {
        abortTxn()
    } catch (KafkaException) {
        // only throw fatal
        shutdown()
    }
}

---------------------

Which is much more complicated.

Thank makes me think, the alternative we have discussed offline may be
better: let commitTxn() to return a boolean flag.

* If it returns true, it means the commit succeeded. Users can comfortably
continue and do any external non-rollback operations if they like.
* If it returns false, it means the commit failed with non-fatal error *AND
the txn has been aborted*. Users do not need to call abortTxn again.
* If it throws, then it means fatal errors. Users should shut down the
client.

In this case, the pattern becomes:

try {
   beginTxn()
   // do something
} catch (Exception) {
   shouldCommit = false;
}

try {
    if (shouldCommit) {
        commitSucceeded = commitTxn()
    } else {
        // reset offsets, rollback operations, etc
        abortTxn()
    }
} catch (KafkaException) {
    // only throw fatal
    shutdown()
}

if (commitSucceeded)
   // do other non-rollbackable things
else
   // reset offsets, rollback operations, etc

-------------------------

Of course, if we want to have better visibility into what caused the commit
to fail and txn to abort. We can let the return type be an enum instead of
a flag. But my main idea is to still let the commitTxn be the final point
users can tell whether this txn succeeded or not, instead of relying on the
next beginTxn() call.

1. Re: "while maintaining the behavior to throw fatal exception in raw" not
sure what do you mean by "throw" here. Are you proposing the callback would
*pass* (not throw) in any fatal exceptions when triggered without wrapping?

2. I'm not sure I understand the claim regarding the callback that "In EOS
setup, it is not required to handle the exception". Are you proposing that,
e.g. in Streams, we do not try to handle any exceptions if EOS is enabled
in the callback anymore, but just let commitTxn() itself to fail to be
notified about the problem? Personally I think in Streams we should just
make the handling logic of the callback to be consistent regardless of the
EOS settings (today we have different logic depending on this logic, which
I think could be unified as well).



Guozhang



On Thu, Dec 17, 2020 at 8:42 PM Boyang Chen <re...@gmail.com>
wrote:

> Thanks for everyone's feedback so far. I have polished the KIP after
> offline discussion with some folks working on EOS to make the exception
> handling more lightweight. The essential change is that we are not
> inventing a new intermediate exception type, but instead separating a full
> transaction into two phases:
>
> 1. The data transmission phase
> 2. The commit phase
>
> This way for any exception thrown from phase 1, will be an indicator in
> phase 2 whether we should do commit or abort, and from now on
> `commitTransaction` should only throw fatal exceptions, similar to
> `abortTransaction`, so that any KafkaException caught in the commit phase
> will be definitely fatal to crash the app. For more advanced users such as
> Streams, we have the ability to further wrap selected types of fatal
> exceptions to trigger task migration if necessary.
>
> More details in the KIP, feel free to take another look, thanks!
>
> On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <re...@gmail.com>
> wrote:
>
> > Thanks Bruno for the feedback.
> >
> > On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <br...@confluent.io> wrote:
> >
> >> Thanks Boyang for the KIP!
> >>
> >> Like Matthias, I do also not know the producer internal well enough to
> >> comment on the categorization. However, I think having a super exception
> >> (e.g. RetriableException) that  encodes if an exception is fatal or not
> >> is cleaner, better understandable and less error-prone, because ideally
> >> when you add a new non-fatal exception in future you just need to think
> >> about letting it inherit from the super exception and all the rest of
> >> the code will just behave correctly without the need to wrap the new
> >> exception into another exception each time it is thrown (maybe it is
> >> thrown at different location in the code).
> >>
> >> As far as I understand the following statement from your previous e-mail
> >> is the reason that currently such a super exception is not possible:
> >>
> >> "Right now we have RetriableException type, if we are going to add a
> >> `ProducerRetriableException` type, we have to put this new interface as
> >> the parent of the RetriableException, because not all thrown non-fatal
> >> exceptions are `retriable` in general for producer"
> >>
> >>
> >> In the list of exceptions in your KIP, I found non-fatal exceptions that
> >> do not inherit from RetriableException. I guess those are the ones you
> >> are referring to in your statement:
> >>
> >> InvalidProducerEpochException
> >> InvalidPidMappingException
> >> TransactionAbortedException
> >>
> >> All of those exceptions are non-fatal and do not inherit from
> >> RetriableException. Is there a reason for that? If they depended from
> >> RetriableException we would be a bit closer to a super exception I
> >> mention above.
> >>
> >> The reason is that sender may catch those exceptions in the
> > ProduceResponse, and it currently does infinite
> > retries on RetriableException. To make sure we could trigger the
> > abortTransaction() by catching non-fatal thrown
> > exceptions and reinitialize the txn state, we chose not to let those
> > exceptions inherit RetriableException so that
> > they won't cause infinite retry on sender.
> >
> >
> >> With OutOfOrderSequenceException and UnknownProducerIdException, I think
> >> to understand that their fatality depends on the type (i.e.
> >> configuration) of the producer. That makes it difficult to have a super
> >> exception that encodes the retriability as mentioned above. Would it be
> >> possible to introduce exceptions that inherit from RetriableException
> >> and that are thrown when those exceptions are caught from the brokers
> >> and the type of the producer is such that the exceptions are retriable?
> >>
> >> Yea, I think in general the exception type mixing is difficult to get
> > them all right. I have already proposed another solution based on my
> > offline discussion with some folks working on EOS
> > to make the handling more straightforward for end users without the need
> > to distinguish exception fatality.
> >
> >> Best,
> >> Bruno
> >>
> >>
> >> On 04.12.20 19:34, Guozhang Wang wrote:
> >> > Thanks Boyang for the proposal! I made a pass over the list and here
> are
> >> > some thoughts:
> >> >
> >> > 0) Although this is not part of the public API, I think we should make
> >> sure
> >> > that the suggested pattern (i.e. user can always call abortTxn() when
> >> > handling non-fatal errors) are indeed supported. E.g. if the txn is
> >> already
> >> > aborted by the broker side, then users can still call abortTxn which
> >> would
> >> > not throw another exception but just be treated as a no-op.
> >> >
> >> > 1) *ConcurrentTransactionsException*: I think this error can also be
> >> > returned but not documented yet. This should be a non-fatal error.
> >> >
> >> > 2) *InvalidTxnStateException*: this error is returned from broker when
> >> txn
> >> > state transition failed (e.g. it is trying to transit to
> complete-commit
> >> > while the current state is not prepare-commit). This error could
> >> indicates
> >> > a bug on the client internal code or the broker code, OR a user error
> >> --- a
> >> > similar error is ConcurrentTransactionsException, i.e. if Kafka is
> >> bug-free
> >> > these exceptions would only be returned if users try to do something
> >> wrong,
> >> > e.g. calling abortTxn right after a commitTxn, etc. So I'm thinking it
> >> > should be a non-fatal error instead of a fatal error, wdyt?
> >> >
> >> > 3) *KafkaException*: case i "indicates fatal transactional sequence
> >> > (Fatal)", this is a bit conflicting with the *OutOfSequenceException*
> >> that
> >> > is treated as non-fatal. I guess your proposal is that
> >> > OutOfOrderSequenceException would be treated either as fatal with
> >> > transactional producer, or non-fatal with idempotent producer, is that
> >> > right? If the producer is only configured with idempotency but not
> >> > transaction, then throwing a TransactionStateCorruptedException for
> >> > non-fatal errors would be confusing since users are not using
> >> transactions
> >> > at all.. So I suggest we always throw OutOfSequenceException as-is
> (i.e.
> >> > not wrapped) no matter how the producer is configured, and let the
> >> caller
> >> > decide how to handle it based on whether it is only idempotent or
> >> > transactional itself.
> >> >
> >> > 4) Besides all the txn APIs, the `send()` callback / future can also
> >> throw
> >> > txn-related exceptions, I think this KIP should also cover this API as
> >> well?
> >> >
> >> > 5) This is related to 1/2) above: sometimes those non-fatal errors
> like
> >> > ConcurrentTxn or InvalidTxnState are not due to the state being
> >> corrupted
> >> > at the broker side, but maybe users are doing something wrong. So I'm
> >> > wondering if we should further distinguish those non-fatal errors
> >> between
> >> > a) those that are caused by Kafka itself, e.g. a broker timed out and
> >> > aborted a txn and later an endTxn request is received, and b) the
> user's
> >> > API call pattern is incorrect, causing the request to be rejected with
> >> an
> >> > error code from the broker. *TransactionStateCorruptedException* feels
> >> to
> >> > me more like for case a), but not case b).
> >> >
> >> >
> >> > Guozhang
> >> >
> >> >
> >> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <
> reluctanthero104@gmail.com>
> >> > wrote:
> >> >
> >> >> Thanks Matthias, I think your proposal makes sense as well, on the
> pro
> >> side
> >> >> we could have a universally agreed exception type to be caught by the
> >> user,
> >> >> without having an extra layer on top of the actual exceptions. I
> could
> >> see
> >> >> some issue on downsides:
> >> >>
> >> >> 1. The exception hierarchy will be more complex. Right now we have
> >> >> RetriableException type, if we are going to add a
> >> >> `ProducerRetriableException` type, we have to put this new interface
> >> as the
> >> >> parent of the RetriableException, because not all thrown non-fatal
> >> >> exceptions are `retriable` in general for producer, for example
> >> >> <
> >> >>
> >>
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> >> >>> .
> >> >> We could have an empty interface `ProducerRetriableException` to let
> >> all
> >> >> the thrown exceptions implement for sure, but it's a bit unorthodox
> in
> >> the
> >> >> way we deal with exceptions in general.
> >> >>
> >> >> 2. There are cases where we throw a KafkaException wrapping another
> >> >> KafkaException as either fatal or non-fatal. If we use an interface
> to
> >> >> solve #1, it is also required to implement another bloated exception
> >> class
> >> >> which could replace KafkaException type, as we couldn't mark
> >> KafkaException
> >> >> as retriable for sure.
> >> >>
> >> >> 3. In terms of the encapsulation, wrapping means we could limit the
> >> scope
> >> >> of affection to the producer only, which is important since we don't
> >> want
> >> >> shared exception types to implement a producer-related interface,
> such
> >> >> as UnknownTopicOrPartitionException.
> >> >>
> >> >> Best,
> >> >> Boyang
> >> >>
> >> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <mj...@apache.org>
> >> wrote:
> >> >>
> >> >>> Thanks for the KIP Boyang!
> >> >>>
> >> >>> Overall, categorizing exceptions makes a lot of sense. As I don't
> know
> >> >>> the producer internals well enough, I cannot comment on the
> >> >>> categorization in detail though.
> >> >>>
> >> >>> What I am wondering is, if we should introduce an exception
> interface
> >> >>> that non-fatal exception would implement instead of creating a new
> >> class
> >> >>> that will wrap non-fatal exceptions? What would be the pros/cons for
> >> >>> both designs?
> >> >>>
> >> >>>
> >> >>> -Matthias
> >> >>>
> >> >>>
> >> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> >> >>>> Hey there,
> >> >>>>
> >> >>>> I would like to start a discussion thread for KIP-691:
> >> >>>>
> >> >>>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> >> >>>>
> >> >>>> The KIP is aiming to simplify the exception handling logic for
> >> >>>> transactional Producer users by classifying fatal and non-fatal
> >> >>> exceptions
> >> >>>> and throw them correspondingly for easier catch and retry. Let me
> >> know
> >> >>> what
> >> >>>> you think.
> >> >>>>
> >> >>>> Best,
> >> >>>> Boyang
> >> >>>>
> >> >>>
> >> >>
> >> >
> >> >
> >>
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Thanks for everyone's feedback so far. I have polished the KIP after
offline discussion with some folks working on EOS to make the exception
handling more lightweight. The essential change is that we are not
inventing a new intermediate exception type, but instead separating a full
transaction into two phases:

1. The data transmission phase
2. The commit phase

This way for any exception thrown from phase 1, will be an indicator in
phase 2 whether we should do commit or abort, and from now on
`commitTransaction` should only throw fatal exceptions, similar to
`abortTransaction`, so that any KafkaException caught in the commit phase
will be definitely fatal to crash the app. For more advanced users such as
Streams, we have the ability to further wrap selected types of fatal
exceptions to trigger task migration if necessary.

More details in the KIP, feel free to take another look, thanks!

On Thu, Dec 17, 2020 at 8:36 PM Boyang Chen <re...@gmail.com>
wrote:

> Thanks Bruno for the feedback.
>
> On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <br...@confluent.io> wrote:
>
>> Thanks Boyang for the KIP!
>>
>> Like Matthias, I do also not know the producer internal well enough to
>> comment on the categorization. However, I think having a super exception
>> (e.g. RetriableException) that  encodes if an exception is fatal or not
>> is cleaner, better understandable and less error-prone, because ideally
>> when you add a new non-fatal exception in future you just need to think
>> about letting it inherit from the super exception and all the rest of
>> the code will just behave correctly without the need to wrap the new
>> exception into another exception each time it is thrown (maybe it is
>> thrown at different location in the code).
>>
>> As far as I understand the following statement from your previous e-mail
>> is the reason that currently such a super exception is not possible:
>>
>> "Right now we have RetriableException type, if we are going to add a
>> `ProducerRetriableException` type, we have to put this new interface as
>> the parent of the RetriableException, because not all thrown non-fatal
>> exceptions are `retriable` in general for producer"
>>
>>
>> In the list of exceptions in your KIP, I found non-fatal exceptions that
>> do not inherit from RetriableException. I guess those are the ones you
>> are referring to in your statement:
>>
>> InvalidProducerEpochException
>> InvalidPidMappingException
>> TransactionAbortedException
>>
>> All of those exceptions are non-fatal and do not inherit from
>> RetriableException. Is there a reason for that? If they depended from
>> RetriableException we would be a bit closer to a super exception I
>> mention above.
>>
>> The reason is that sender may catch those exceptions in the
> ProduceResponse, and it currently does infinite
> retries on RetriableException. To make sure we could trigger the
> abortTransaction() by catching non-fatal thrown
> exceptions and reinitialize the txn state, we chose not to let those
> exceptions inherit RetriableException so that
> they won't cause infinite retry on sender.
>
>
>> With OutOfOrderSequenceException and UnknownProducerIdException, I think
>> to understand that their fatality depends on the type (i.e.
>> configuration) of the producer. That makes it difficult to have a super
>> exception that encodes the retriability as mentioned above. Would it be
>> possible to introduce exceptions that inherit from RetriableException
>> and that are thrown when those exceptions are caught from the brokers
>> and the type of the producer is such that the exceptions are retriable?
>>
>> Yea, I think in general the exception type mixing is difficult to get
> them all right. I have already proposed another solution based on my
> offline discussion with some folks working on EOS
> to make the handling more straightforward for end users without the need
> to distinguish exception fatality.
>
>> Best,
>> Bruno
>>
>>
>> On 04.12.20 19:34, Guozhang Wang wrote:
>> > Thanks Boyang for the proposal! I made a pass over the list and here are
>> > some thoughts:
>> >
>> > 0) Although this is not part of the public API, I think we should make
>> sure
>> > that the suggested pattern (i.e. user can always call abortTxn() when
>> > handling non-fatal errors) are indeed supported. E.g. if the txn is
>> already
>> > aborted by the broker side, then users can still call abortTxn which
>> would
>> > not throw another exception but just be treated as a no-op.
>> >
>> > 1) *ConcurrentTransactionsException*: I think this error can also be
>> > returned but not documented yet. This should be a non-fatal error.
>> >
>> > 2) *InvalidTxnStateException*: this error is returned from broker when
>> txn
>> > state transition failed (e.g. it is trying to transit to complete-commit
>> > while the current state is not prepare-commit). This error could
>> indicates
>> > a bug on the client internal code or the broker code, OR a user error
>> --- a
>> > similar error is ConcurrentTransactionsException, i.e. if Kafka is
>> bug-free
>> > these exceptions would only be returned if users try to do something
>> wrong,
>> > e.g. calling abortTxn right after a commitTxn, etc. So I'm thinking it
>> > should be a non-fatal error instead of a fatal error, wdyt?
>> >
>> > 3) *KafkaException*: case i "indicates fatal transactional sequence
>> > (Fatal)", this is a bit conflicting with the *OutOfSequenceException*
>> that
>> > is treated as non-fatal. I guess your proposal is that
>> > OutOfOrderSequenceException would be treated either as fatal with
>> > transactional producer, or non-fatal with idempotent producer, is that
>> > right? If the producer is only configured with idempotency but not
>> > transaction, then throwing a TransactionStateCorruptedException for
>> > non-fatal errors would be confusing since users are not using
>> transactions
>> > at all.. So I suggest we always throw OutOfSequenceException as-is (i.e.
>> > not wrapped) no matter how the producer is configured, and let the
>> caller
>> > decide how to handle it based on whether it is only idempotent or
>> > transactional itself.
>> >
>> > 4) Besides all the txn APIs, the `send()` callback / future can also
>> throw
>> > txn-related exceptions, I think this KIP should also cover this API as
>> well?
>> >
>> > 5) This is related to 1/2) above: sometimes those non-fatal errors like
>> > ConcurrentTxn or InvalidTxnState are not due to the state being
>> corrupted
>> > at the broker side, but maybe users are doing something wrong. So I'm
>> > wondering if we should further distinguish those non-fatal errors
>> between
>> > a) those that are caused by Kafka itself, e.g. a broker timed out and
>> > aborted a txn and later an endTxn request is received, and b) the user's
>> > API call pattern is incorrect, causing the request to be rejected with
>> an
>> > error code from the broker. *TransactionStateCorruptedException* feels
>> to
>> > me more like for case a), but not case b).
>> >
>> >
>> > Guozhang
>> >
>> >
>> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <re...@gmail.com>
>> > wrote:
>> >
>> >> Thanks Matthias, I think your proposal makes sense as well, on the pro
>> side
>> >> we could have a universally agreed exception type to be caught by the
>> user,
>> >> without having an extra layer on top of the actual exceptions. I could
>> see
>> >> some issue on downsides:
>> >>
>> >> 1. The exception hierarchy will be more complex. Right now we have
>> >> RetriableException type, if we are going to add a
>> >> `ProducerRetriableException` type, we have to put this new interface
>> as the
>> >> parent of the RetriableException, because not all thrown non-fatal
>> >> exceptions are `retriable` in general for producer, for example
>> >> <
>> >>
>> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
>> >>> .
>> >> We could have an empty interface `ProducerRetriableException` to let
>> all
>> >> the thrown exceptions implement for sure, but it's a bit unorthodox in
>> the
>> >> way we deal with exceptions in general.
>> >>
>> >> 2. There are cases where we throw a KafkaException wrapping another
>> >> KafkaException as either fatal or non-fatal. If we use an interface to
>> >> solve #1, it is also required to implement another bloated exception
>> class
>> >> which could replace KafkaException type, as we couldn't mark
>> KafkaException
>> >> as retriable for sure.
>> >>
>> >> 3. In terms of the encapsulation, wrapping means we could limit the
>> scope
>> >> of affection to the producer only, which is important since we don't
>> want
>> >> shared exception types to implement a producer-related interface, such
>> >> as UnknownTopicOrPartitionException.
>> >>
>> >> Best,
>> >> Boyang
>> >>
>> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <mj...@apache.org>
>> wrote:
>> >>
>> >>> Thanks for the KIP Boyang!
>> >>>
>> >>> Overall, categorizing exceptions makes a lot of sense. As I don't know
>> >>> the producer internals well enough, I cannot comment on the
>> >>> categorization in detail though.
>> >>>
>> >>> What I am wondering is, if we should introduce an exception interface
>> >>> that non-fatal exception would implement instead of creating a new
>> class
>> >>> that will wrap non-fatal exceptions? What would be the pros/cons for
>> >>> both designs?
>> >>>
>> >>>
>> >>> -Matthias
>> >>>
>> >>>
>> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
>> >>>> Hey there,
>> >>>>
>> >>>> I would like to start a discussion thread for KIP-691:
>> >>>>
>> >>>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
>> >>>>
>> >>>> The KIP is aiming to simplify the exception handling logic for
>> >>>> transactional Producer users by classifying fatal and non-fatal
>> >>> exceptions
>> >>>> and throw them correspondingly for easier catch and retry. Let me
>> know
>> >>> what
>> >>>> you think.
>> >>>>
>> >>>> Best,
>> >>>> Boyang
>> >>>>
>> >>>
>> >>
>> >
>> >
>>
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Thanks Bruno for the feedback.

On Mon, Dec 7, 2020 at 5:26 AM Bruno Cadonna <br...@confluent.io> wrote:

> Thanks Boyang for the KIP!
>
> Like Matthias, I do also not know the producer internal well enough to
> comment on the categorization. However, I think having a super exception
> (e.g. RetriableException) that  encodes if an exception is fatal or not
> is cleaner, better understandable and less error-prone, because ideally
> when you add a new non-fatal exception in future you just need to think
> about letting it inherit from the super exception and all the rest of
> the code will just behave correctly without the need to wrap the new
> exception into another exception each time it is thrown (maybe it is
> thrown at different location in the code).
>
> As far as I understand the following statement from your previous e-mail
> is the reason that currently such a super exception is not possible:
>
> "Right now we have RetriableException type, if we are going to add a
> `ProducerRetriableException` type, we have to put this new interface as
> the parent of the RetriableException, because not all thrown non-fatal
> exceptions are `retriable` in general for producer"
>
>
> In the list of exceptions in your KIP, I found non-fatal exceptions that
> do not inherit from RetriableException. I guess those are the ones you
> are referring to in your statement:
>
> InvalidProducerEpochException
> InvalidPidMappingException
> TransactionAbortedException
>
> All of those exceptions are non-fatal and do not inherit from
> RetriableException. Is there a reason for that? If they depended from
> RetriableException we would be a bit closer to a super exception I
> mention above.
>
> The reason is that sender may catch those exceptions in the
ProduceResponse, and it currently does infinite
retries on RetriableException. To make sure we could trigger the
abortTransaction() by catching non-fatal thrown
exceptions and reinitialize the txn state, we chose not to let those
exceptions inherit RetriableException so that
they won't cause infinite retry on sender.


> With OutOfOrderSequenceException and UnknownProducerIdException, I think
> to understand that their fatality depends on the type (i.e.
> configuration) of the producer. That makes it difficult to have a super
> exception that encodes the retriability as mentioned above. Would it be
> possible to introduce exceptions that inherit from RetriableException
> and that are thrown when those exceptions are caught from the brokers
> and the type of the producer is such that the exceptions are retriable?
>
> Yea, I think in general the exception type mixing is difficult to get them
all right. I have already proposed another solution based on my offline
discussion with some folks working on EOS
to make the handling more straightforward for end users without the need to
distinguish exception fatality.

> Best,
> Bruno
>
>
> On 04.12.20 19:34, Guozhang Wang wrote:
> > Thanks Boyang for the proposal! I made a pass over the list and here are
> > some thoughts:
> >
> > 0) Although this is not part of the public API, I think we should make
> sure
> > that the suggested pattern (i.e. user can always call abortTxn() when
> > handling non-fatal errors) are indeed supported. E.g. if the txn is
> already
> > aborted by the broker side, then users can still call abortTxn which
> would
> > not throw another exception but just be treated as a no-op.
> >
> > 1) *ConcurrentTransactionsException*: I think this error can also be
> > returned but not documented yet. This should be a non-fatal error.
> >
> > 2) *InvalidTxnStateException*: this error is returned from broker when
> txn
> > state transition failed (e.g. it is trying to transit to complete-commit
> > while the current state is not prepare-commit). This error could
> indicates
> > a bug on the client internal code or the broker code, OR a user error
> --- a
> > similar error is ConcurrentTransactionsException, i.e. if Kafka is
> bug-free
> > these exceptions would only be returned if users try to do something
> wrong,
> > e.g. calling abortTxn right after a commitTxn, etc. So I'm thinking it
> > should be a non-fatal error instead of a fatal error, wdyt?
> >
> > 3) *KafkaException*: case i "indicates fatal transactional sequence
> > (Fatal)", this is a bit conflicting with the *OutOfSequenceException*
> that
> > is treated as non-fatal. I guess your proposal is that
> > OutOfOrderSequenceException would be treated either as fatal with
> > transactional producer, or non-fatal with idempotent producer, is that
> > right? If the producer is only configured with idempotency but not
> > transaction, then throwing a TransactionStateCorruptedException for
> > non-fatal errors would be confusing since users are not using
> transactions
> > at all.. So I suggest we always throw OutOfSequenceException as-is (i.e.
> > not wrapped) no matter how the producer is configured, and let the caller
> > decide how to handle it based on whether it is only idempotent or
> > transactional itself.
> >
> > 4) Besides all the txn APIs, the `send()` callback / future can also
> throw
> > txn-related exceptions, I think this KIP should also cover this API as
> well?
> >
> > 5) This is related to 1/2) above: sometimes those non-fatal errors like
> > ConcurrentTxn or InvalidTxnState are not due to the state being corrupted
> > at the broker side, but maybe users are doing something wrong. So I'm
> > wondering if we should further distinguish those non-fatal errors between
> > a) those that are caused by Kafka itself, e.g. a broker timed out and
> > aborted a txn and later an endTxn request is received, and b) the user's
> > API call pattern is incorrect, causing the request to be rejected with an
> > error code from the broker. *TransactionStateCorruptedException* feels to
> > me more like for case a), but not case b).
> >
> >
> > Guozhang
> >
> >
> > On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <re...@gmail.com>
> > wrote:
> >
> >> Thanks Matthias, I think your proposal makes sense as well, on the pro
> side
> >> we could have a universally agreed exception type to be caught by the
> user,
> >> without having an extra layer on top of the actual exceptions. I could
> see
> >> some issue on downsides:
> >>
> >> 1. The exception hierarchy will be more complex. Right now we have
> >> RetriableException type, if we are going to add a
> >> `ProducerRetriableException` type, we have to put this new interface as
> the
> >> parent of the RetriableException, because not all thrown non-fatal
> >> exceptions are `retriable` in general for producer, for example
> >> <
> >>
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> >>> .
> >> We could have an empty interface `ProducerRetriableException` to let all
> >> the thrown exceptions implement for sure, but it's a bit unorthodox in
> the
> >> way we deal with exceptions in general.
> >>
> >> 2. There are cases where we throw a KafkaException wrapping another
> >> KafkaException as either fatal or non-fatal. If we use an interface to
> >> solve #1, it is also required to implement another bloated exception
> class
> >> which could replace KafkaException type, as we couldn't mark
> KafkaException
> >> as retriable for sure.
> >>
> >> 3. In terms of the encapsulation, wrapping means we could limit the
> scope
> >> of affection to the producer only, which is important since we don't
> want
> >> shared exception types to implement a producer-related interface, such
> >> as UnknownTopicOrPartitionException.
> >>
> >> Best,
> >> Boyang
> >>
> >> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <mj...@apache.org>
> wrote:
> >>
> >>> Thanks for the KIP Boyang!
> >>>
> >>> Overall, categorizing exceptions makes a lot of sense. As I don't know
> >>> the producer internals well enough, I cannot comment on the
> >>> categorization in detail though.
> >>>
> >>> What I am wondering is, if we should introduce an exception interface
> >>> that non-fatal exception would implement instead of creating a new
> class
> >>> that will wrap non-fatal exceptions? What would be the pros/cons for
> >>> both designs?
> >>>
> >>>
> >>> -Matthias
> >>>
> >>>
> >>> On 12/2/20 11:35 AM, Boyang Chen wrote:
> >>>> Hey there,
> >>>>
> >>>> I would like to start a discussion thread for KIP-691:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> >>>>
> >>>> The KIP is aiming to simplify the exception handling logic for
> >>>> transactional Producer users by classifying fatal and non-fatal
> >>> exceptions
> >>>> and throw them correspondingly for easier catch and retry. Let me know
> >>> what
> >>>> you think.
> >>>>
> >>>> Best,
> >>>> Boyang
> >>>>
> >>>
> >>
> >
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Bruno Cadonna <br...@confluent.io>.
Thanks Boyang for the KIP!

Like Matthias, I do also not know the producer internal well enough to 
comment on the categorization. However, I think having a super exception 
(e.g. RetriableException) that  encodes if an exception is fatal or not 
is cleaner, better understandable and less error-prone, because ideally 
when you add a new non-fatal exception in future you just need to think 
about letting it inherit from the super exception and all the rest of 
the code will just behave correctly without the need to wrap the new 
exception into another exception each time it is thrown (maybe it is 
thrown at different location in the code).

As far as I understand the following statement from your previous e-mail 
is the reason that currently such a super exception is not possible:

"Right now we have RetriableException type, if we are going to add a 
`ProducerRetriableException` type, we have to put this new interface as 
the parent of the RetriableException, because not all thrown non-fatal 
exceptions are `retriable` in general for producer"


In the list of exceptions in your KIP, I found non-fatal exceptions that 
do not inherit from RetriableException. I guess those are the ones you 
are referring to in your statement:

InvalidProducerEpochException
InvalidPidMappingException
TransactionAbortedException

All of those exceptions are non-fatal and do not inherit from 
RetriableException. Is there a reason for that? If they depended from 
RetriableException we would be a bit closer to a super exception I 
mention above.

With OutOfOrderSequenceException and UnknownProducerIdException, I think 
to understand that their fatality depends on the type (i.e. 
configuration) of the producer. That makes it difficult to have a super 
exception that encodes the retriability as mentioned above. Would it be 
possible to introduce exceptions that inherit from RetriableException 
and that are thrown when those exceptions are caught from the brokers 
and the type of the producer is such that the exceptions are retriable?

Best,
Bruno


On 04.12.20 19:34, Guozhang Wang wrote:
> Thanks Boyang for the proposal! I made a pass over the list and here are
> some thoughts:
> 
> 0) Although this is not part of the public API, I think we should make sure
> that the suggested pattern (i.e. user can always call abortTxn() when
> handling non-fatal errors) are indeed supported. E.g. if the txn is already
> aborted by the broker side, then users can still call abortTxn which would
> not throw another exception but just be treated as a no-op.
> 
> 1) *ConcurrentTransactionsException*: I think this error can also be
> returned but not documented yet. This should be a non-fatal error.
> 
> 2) *InvalidTxnStateException*: this error is returned from broker when txn
> state transition failed (e.g. it is trying to transit to complete-commit
> while the current state is not prepare-commit). This error could indicates
> a bug on the client internal code or the broker code, OR a user error --- a
> similar error is ConcurrentTransactionsException, i.e. if Kafka is bug-free
> these exceptions would only be returned if users try to do something wrong,
> e.g. calling abortTxn right after a commitTxn, etc. So I'm thinking it
> should be a non-fatal error instead of a fatal error, wdyt?
> 
> 3) *KafkaException*: case i "indicates fatal transactional sequence
> (Fatal)", this is a bit conflicting with the *OutOfSequenceException* that
> is treated as non-fatal. I guess your proposal is that
> OutOfOrderSequenceException would be treated either as fatal with
> transactional producer, or non-fatal with idempotent producer, is that
> right? If the producer is only configured with idempotency but not
> transaction, then throwing a TransactionStateCorruptedException for
> non-fatal errors would be confusing since users are not using transactions
> at all.. So I suggest we always throw OutOfSequenceException as-is (i.e.
> not wrapped) no matter how the producer is configured, and let the caller
> decide how to handle it based on whether it is only idempotent or
> transactional itself.
> 
> 4) Besides all the txn APIs, the `send()` callback / future can also throw
> txn-related exceptions, I think this KIP should also cover this API as well?
> 
> 5) This is related to 1/2) above: sometimes those non-fatal errors like
> ConcurrentTxn or InvalidTxnState are not due to the state being corrupted
> at the broker side, but maybe users are doing something wrong. So I'm
> wondering if we should further distinguish those non-fatal errors between
> a) those that are caused by Kafka itself, e.g. a broker timed out and
> aborted a txn and later an endTxn request is received, and b) the user's
> API call pattern is incorrect, causing the request to be rejected with an
> error code from the broker. *TransactionStateCorruptedException* feels to
> me more like for case a), but not case b).
> 
> 
> Guozhang
> 
> 
> On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <re...@gmail.com>
> wrote:
> 
>> Thanks Matthias, I think your proposal makes sense as well, on the pro side
>> we could have a universally agreed exception type to be caught by the user,
>> without having an extra layer on top of the actual exceptions. I could see
>> some issue on downsides:
>>
>> 1. The exception hierarchy will be more complex. Right now we have
>> RetriableException type, if we are going to add a
>> `ProducerRetriableException` type, we have to put this new interface as the
>> parent of the RetriableException, because not all thrown non-fatal
>> exceptions are `retriable` in general for producer, for example
>> <
>> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
>>> .
>> We could have an empty interface `ProducerRetriableException` to let all
>> the thrown exceptions implement for sure, but it's a bit unorthodox in the
>> way we deal with exceptions in general.
>>
>> 2. There are cases where we throw a KafkaException wrapping another
>> KafkaException as either fatal or non-fatal. If we use an interface to
>> solve #1, it is also required to implement another bloated exception class
>> which could replace KafkaException type, as we couldn't mark KafkaException
>> as retriable for sure.
>>
>> 3. In terms of the encapsulation, wrapping means we could limit the scope
>> of affection to the producer only, which is important since we don't want
>> shared exception types to implement a producer-related interface, such
>> as UnknownTopicOrPartitionException.
>>
>> Best,
>> Boyang
>>
>> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <mj...@apache.org> wrote:
>>
>>> Thanks for the KIP Boyang!
>>>
>>> Overall, categorizing exceptions makes a lot of sense. As I don't know
>>> the producer internals well enough, I cannot comment on the
>>> categorization in detail though.
>>>
>>> What I am wondering is, if we should introduce an exception interface
>>> that non-fatal exception would implement instead of creating a new class
>>> that will wrap non-fatal exceptions? What would be the pros/cons for
>>> both designs?
>>>
>>>
>>> -Matthias
>>>
>>>
>>> On 12/2/20 11:35 AM, Boyang Chen wrote:
>>>> Hey there,
>>>>
>>>> I would like to start a discussion thread for KIP-691:
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
>>>>
>>>> The KIP is aiming to simplify the exception handling logic for
>>>> transactional Producer users by classifying fatal and non-fatal
>>> exceptions
>>>> and throw them correspondingly for easier catch and retry. Let me know
>>> what
>>>> you think.
>>>>
>>>> Best,
>>>> Boyang
>>>>
>>>
>>
> 
> 

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Guozhang Wang <wa...@gmail.com>.
Thanks Boyang for the proposal! I made a pass over the list and here are
some thoughts:

0) Although this is not part of the public API, I think we should make sure
that the suggested pattern (i.e. user can always call abortTxn() when
handling non-fatal errors) are indeed supported. E.g. if the txn is already
aborted by the broker side, then users can still call abortTxn which would
not throw another exception but just be treated as a no-op.

1) *ConcurrentTransactionsException*: I think this error can also be
returned but not documented yet. This should be a non-fatal error.

2) *InvalidTxnStateException*: this error is returned from broker when txn
state transition failed (e.g. it is trying to transit to complete-commit
while the current state is not prepare-commit). This error could indicates
a bug on the client internal code or the broker code, OR a user error --- a
similar error is ConcurrentTransactionsException, i.e. if Kafka is bug-free
these exceptions would only be returned if users try to do something wrong,
e.g. calling abortTxn right after a commitTxn, etc. So I'm thinking it
should be a non-fatal error instead of a fatal error, wdyt?

3) *KafkaException*: case i "indicates fatal transactional sequence
(Fatal)", this is a bit conflicting with the *OutOfSequenceException* that
is treated as non-fatal. I guess your proposal is that
OutOfOrderSequenceException would be treated either as fatal with
transactional producer, or non-fatal with idempotent producer, is that
right? If the producer is only configured with idempotency but not
transaction, then throwing a TransactionStateCorruptedException for
non-fatal errors would be confusing since users are not using transactions
at all.. So I suggest we always throw OutOfSequenceException as-is (i.e.
not wrapped) no matter how the producer is configured, and let the caller
decide how to handle it based on whether it is only idempotent or
transactional itself.

4) Besides all the txn APIs, the `send()` callback / future can also throw
txn-related exceptions, I think this KIP should also cover this API as well?

5) This is related to 1/2) above: sometimes those non-fatal errors like
ConcurrentTxn or InvalidTxnState are not due to the state being corrupted
at the broker side, but maybe users are doing something wrong. So I'm
wondering if we should further distinguish those non-fatal errors between
a) those that are caused by Kafka itself, e.g. a broker timed out and
aborted a txn and later an endTxn request is received, and b) the user's
API call pattern is incorrect, causing the request to be rejected with an
error code from the broker. *TransactionStateCorruptedException* feels to
me more like for case a), but not case b).


Guozhang


On Wed, Dec 2, 2020 at 4:50 PM Boyang Chen <re...@gmail.com>
wrote:

> Thanks Matthias, I think your proposal makes sense as well, on the pro side
> we could have a universally agreed exception type to be caught by the user,
> without having an extra layer on top of the actual exceptions. I could see
> some issue on downsides:
>
> 1. The exception hierarchy will be more complex. Right now we have
> RetriableException type, if we are going to add a
> `ProducerRetriableException` type, we have to put this new interface as the
> parent of the RetriableException, because not all thrown non-fatal
> exceptions are `retriable` in general for producer, for example
> <
> https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745
> >.
> We could have an empty interface `ProducerRetriableException` to let all
> the thrown exceptions implement for sure, but it's a bit unorthodox in the
> way we deal with exceptions in general.
>
> 2. There are cases where we throw a KafkaException wrapping another
> KafkaException as either fatal or non-fatal. If we use an interface to
> solve #1, it is also required to implement another bloated exception class
> which could replace KafkaException type, as we couldn't mark KafkaException
> as retriable for sure.
>
> 3. In terms of the encapsulation, wrapping means we could limit the scope
> of affection to the producer only, which is important since we don't want
> shared exception types to implement a producer-related interface, such
> as UnknownTopicOrPartitionException.
>
> Best,
> Boyang
>
> On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <mj...@apache.org> wrote:
>
> > Thanks for the KIP Boyang!
> >
> > Overall, categorizing exceptions makes a lot of sense. As I don't know
> > the producer internals well enough, I cannot comment on the
> > categorization in detail though.
> >
> > What I am wondering is, if we should introduce an exception interface
> > that non-fatal exception would implement instead of creating a new class
> > that will wrap non-fatal exceptions? What would be the pros/cons for
> > both designs?
> >
> >
> > -Matthias
> >
> >
> > On 12/2/20 11:35 AM, Boyang Chen wrote:
> > > Hey there,
> > >
> > > I would like to start a discussion thread for KIP-691:
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> > >
> > > The KIP is aiming to simplify the exception handling logic for
> > > transactional Producer users by classifying fatal and non-fatal
> > exceptions
> > > and throw them correspondingly for easier catch and retry. Let me know
> > what
> > > you think.
> > >
> > > Best,
> > > Boyang
> > >
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by Boyang Chen <re...@gmail.com>.
Thanks Matthias, I think your proposal makes sense as well, on the pro side
we could have a universally agreed exception type to be caught by the user,
without having an extra layer on top of the actual exceptions. I could see
some issue on downsides:

1. The exception hierarchy will be more complex. Right now we have
RetriableException type, if we are going to add a
`ProducerRetriableException` type, we have to put this new interface as the
parent of the RetriableException, because not all thrown non-fatal
exceptions are `retriable` in general for producer, for example
<https://github.com/apache/kafka/blob/e275742f850af4a1b79b0d1bd1ac9a1d2e89c64e/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L745>.
We could have an empty interface `ProducerRetriableException` to let all
the thrown exceptions implement for sure, but it's a bit unorthodox in the
way we deal with exceptions in general.

2. There are cases where we throw a KafkaException wrapping another
KafkaException as either fatal or non-fatal. If we use an interface to
solve #1, it is also required to implement another bloated exception class
which could replace KafkaException type, as we couldn't mark KafkaException
as retriable for sure.

3. In terms of the encapsulation, wrapping means we could limit the scope
of affection to the producer only, which is important since we don't want
shared exception types to implement a producer-related interface, such
as UnknownTopicOrPartitionException.

Best,
Boyang

On Wed, Dec 2, 2020 at 3:38 PM Matthias J. Sax <mj...@apache.org> wrote:

> Thanks for the KIP Boyang!
>
> Overall, categorizing exceptions makes a lot of sense. As I don't know
> the producer internals well enough, I cannot comment on the
> categorization in detail though.
>
> What I am wondering is, if we should introduce an exception interface
> that non-fatal exception would implement instead of creating a new class
> that will wrap non-fatal exceptions? What would be the pros/cons for
> both designs?
>
>
> -Matthias
>
>
> On 12/2/20 11:35 AM, Boyang Chen wrote:
> > Hey there,
> >
> > I would like to start a discussion thread for KIP-691:
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> >
> > The KIP is aiming to simplify the exception handling logic for
> > transactional Producer users by classifying fatal and non-fatal
> exceptions
> > and throw them correspondingly for easier catch and retry. Let me know
> what
> > you think.
> >
> > Best,
> > Boyang
> >
>

Re: [DISCUSS] KIP-691: Transactional Producer Exception Handling

Posted by "Matthias J. Sax" <mj...@apache.org>.
Thanks for the KIP Boyang!

Overall, categorizing exceptions makes a lot of sense. As I don't know
the producer internals well enough, I cannot comment on the
categorization in detail though.

What I am wondering is, if we should introduce an exception interface
that non-fatal exception would implement instead of creating a new class
that will wrap non-fatal exceptions? What would be the pros/cons for
both designs?


-Matthias


On 12/2/20 11:35 AM, Boyang Chen wrote:
> Hey there,
> 
> I would like to start a discussion thread for KIP-691:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-691%3A+Enhance+Transactional+Producer+Exception+Handling
> 
> The KIP is aiming to simplify the exception handling logic for
> transactional Producer users by classifying fatal and non-fatal exceptions
> and throw them correspondingly for easier catch and retry. Let me know what
> you think.
> 
> Best,
> Boyang
>