You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Jason Gustafson <ja...@confluent.io> on 2020/08/26 18:32:50 UTC

[DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Hi All,

I've added a proposal to handle the problem of hanging transactions:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions.
In theory, this should never happen. In practice, we have hit one bug where
it was possible and there are few good options today to recover. Take a
look and let me know what you think.

Thanks,
Jason

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Boyang,

Thanks for the comments. Responses below:

> 1. For the analysis section, is there any consistency guarantee for
`ListTransactions` and `DescribeTransactions`? Let's say the coordinator
receives a DescribeTransactions while the transaction is almost complete at
the same time, should we have isolation to avoid returning stale
information? If an admin client sends excessive describe requests, would it
affect the normal processing on transactions?

It's similar to `ListGroups` and `DescribeGroup`. There is no guarantee on
the consistency between separate calls. The API is just reflecting the
current state. In general, tools need to be careful when trying to take
some action based on that state. This is one of the reasons I changed the
`WriteTxnMarkers` API to take the start offset of the transaction the user
wants to abort. In case we get into some race with the coordinator, at
least we won't end up aborting the wrong transaction.

> 2. I'm not sure whether `PartitionsWithLateTransactionsCount` is providing
much value here. Users have no responsibilities to tight their transaction
session size below the max transaction timeout. As long as there is
prolonged progress being made from the client side, the root cause would be
on some ill-performing producers, which means we should monitor the
producer client instead.

Not sure I follow. The max transaction timeout is enforced in the
`AddPartitionsToTxn` API. The client is not allowed a larger timeout, so it
seems fair for the broker to treat transactions which have been open longer
than this as "late." The benefit of this metric is that it provides a
simple alert criteria (i.e. alert for any positive value).

> 3. It would be good to highlight new fields in the
`WriteTxnMarkersRequest`
schema

Ack. Will do.

> 4. The response schema in `ListTransactions` section was wrong, which
should be `ListTransactionsResponse`

Thanks. I'll fix the name, but the schema looks like what I wanted.

>5. From the original context of ticket 9144, the reasoning for the hanging
transaction is due to an uncensored open transaction on the partition
leader. Could we just add the direct admin request support like
`findAllSuspiciousTransaction` to detect that, by scanning all partition
leaders and transaction coordinators within the cluster, and figure out any
open transaction on the partition leader side not known to any coordinator?

This is basically what the --find-hanging command is doing. I debated
moving this logic either into the admin client or the broker, but in the
end, I decided it was better to have good general purpose APIs and let the
complexity be in the tooling.

> 6. typo: "In this case, we the command..." => "In this case, the
command..."

Ack. Will fix.

Thanks,
Jason



On Thu, Aug 27, 2020 at 2:46 PM Boyang Chen <re...@gmail.com>
wrote:

> Thanks Jason for the tooling proposal. A couple of comments:
>
> 1. For the analysis section, is there any consistency guarantee for
> `ListTransactions` and `DescribeTransactions`? Let's say the coordinator
> receives a DescribeTransactions while the transaction is almost complete at
> the same time, should we have isolation to avoid returning stale
> information? If an admin client sends excessive describe requests, would it
> affect the normal processing on transactions?
>
> 2. I'm not sure whether `PartitionsWithLateTransactionsCount` is providing
> much value here. Users have no responsibilities to tight their transaction
> session size below the max transaction timeout. As long as there is
> prolonged progress being made from the client side, the root cause would be
> on some ill-performing producers, which means we should monitor the
> producer client instead.
>
> 3. It would be good to highlight new fields in the `WriteTxnMarkersRequest`
> schema
>
> 4. The response schema in `ListTransactions` section was wrong, which
> should be `ListTransactionsResponse`
>
> 5. From the original context of ticket 9144, the reasoning for the hanging
> transaction is due to an uncensored open transaction on the partition
> leader. Could we just add the direct admin request support like
> `findAllSuspiciousTransaction` to detect that, by scanning all partition
> leaders and transaction coordinators within the cluster, and figure out any
> open transaction on the partition leader side not known to any coordinator?
>
> 6. typo: "In this case, we the command..." => "In this case, the
> command..."
>
> Boyang
>
> On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
> wrote:
>
> > >> Would it be worth returning transactional.id.expiration.ms in the
> > DescribeProducersResponse?
> >
> > > That's an interesting thought as well. Are you trying to avoid the need
> > to
> > specify it through the command line? The tool could also query the value
> > with DescribeConfigs I suppose.
> >
> > Basically. I'm not sure how useful this will be in practice, though it
> > might help when debugging.
> >
> > Lucas
> >
> > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hey Lucas,
> > >
> > > Thanks for the comments. Responses below:
> > >
> > > > Given that it's possible for replica producer states to diverge from
> > each
> > > other, it would be very useful if DescribeProducers(Request,Response)
> and
> > > tooling is able to query all partition replicas for their producers
> > >
> > > Yes, it makes sense to me to let DescribeProducers work on both
> followers
> > > and leaders. In fact, I'm encouraged that there are use cases for this
> > work
> > > other than detecting hanging transactions. That was indeed the hope,
> but
> > I
> > > didn't have anything specific in mind. I will update the proposal.
> > >
> > > > Would it be worth returning transactional.id.expiration.ms in the
> > > DescribeProducersResponse?
> > >
> > > That's an interesting thought as well. Are you trying to avoid the need
> > to
> > > specify it through the command line? The tool could also query the
> value
> > > with DescribeConfigs I suppose.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io>
> > > wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > This looks like a very useful tool, thanks for writing it up.
> > > >
> > > > Given that it's possible for replica producer states to diverge from
> > each
> > > > other, it would be very useful if DescribeProducers(Request,Response)
> > and
> > > > tooling is able to query all partition replicas for their producers.
> > One
> > > > way I can see this being used immediately is in kafka's system tests,
> > > > especially the ones that inject failures. At the end of the test we
> can
> > > > query all replicas and make sure that their states have not
> diverged. I
> > > can
> > > > also see it being useful when debugging production clusters too.
> > > >
> > > > Would it be worth returning transactional.id.expiration.ms in the
> > > > DescribeProducersResponse?
> > > >
> > > > Cheers,
> > > >
> > > > Lucas
> > > >
> > > >
> > > >
> > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rn...@gmail.com>
> > > wrote:
> > > >
> > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > > >
> > > > > Ron
> > > > >
> > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> jason@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hey Ron,
> > > > > >
> > > > > > We do not typically backport new APIs to older versions. I think
> we
> > > can
> > > > > > however make the --abort command compatible with older versions.
> It
> > > > would
> > > > > > require a user to do some analysis on their own to identify a
> > hanging
> > > > > > transaction, but then they can use the tool from a new release to
> > > > > recover.
> > > > > > For example, users could detect a hanging transaction through the
> > > > > existing
> > > > > > "LastStableOffsetLag" metric and then collect the needed
> > information
> > > > > from a
> > > > > > dump of the log (or producer snapshot). It's more work, but at
> > least
> > > > it's
> > > > > > possible. Does that sound fair?
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> rndgstn@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > > >
> > > > > > > Will the implementation be backported to prior Kafka versions?
> > The
> > > > > > reason
> > > > > > > I ask is because if it is not backported and similar
> > functionality
> > > is
> > > > > not
> > > > > > > otherwise made available for older versions, then the only
> > recourse
> > > > > > (aside
> > > > > > > from deleting and recreating the topic as you pointed out) may
> be
> > > to
> > > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > > functionality).
> > > > > > > Such an upgrade may not be desirable, especially if the number
> of
> > > > > > > intermediate versions is considerable. I understand the mantra
> of
> > > > > "never
> > > > > > > fall too many versions behind" but the reality of it is that it
> > > isn't
> > > > > > > always the case.  Even if the version is relatively recent, an
> > > > upgrade
> > > > > > may
> > > > > > > still not be possible for some time, and a quicker resolution
> may
> > > be
> > > > > > > necessary.
> > > > > > >
> > > > > > > Ron
> > > > > > >
> > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > jason@confluent.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > I've added a proposal to handle the problem of hanging
> > > > transactions:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > .
> > > > > > > > In theory, this should never happen. In practice, we have hit
> > one
> > > > bug
> > > > > > > where
> > > > > > > > it was possible and there are few good options today to
> > recover.
> > > > > Take a
> > > > > > > > look and let me know what you think.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Boyang Chen <re...@gmail.com>.
Thanks Jason for the tooling proposal. A couple of comments:

1. For the analysis section, is there any consistency guarantee for
`ListTransactions` and `DescribeTransactions`? Let's say the coordinator
receives a DescribeTransactions while the transaction is almost complete at
the same time, should we have isolation to avoid returning stale
information? If an admin client sends excessive describe requests, would it
affect the normal processing on transactions?

2. I'm not sure whether `PartitionsWithLateTransactionsCount` is providing
much value here. Users have no responsibilities to tight their transaction
session size below the max transaction timeout. As long as there is
prolonged progress being made from the client side, the root cause would be
on some ill-performing producers, which means we should monitor the
producer client instead.

3. It would be good to highlight new fields in the `WriteTxnMarkersRequest`
schema

4. The response schema in `ListTransactions` section was wrong, which
should be `ListTransactionsResponse`

5. From the original context of ticket 9144, the reasoning for the hanging
transaction is due to an uncensored open transaction on the partition
leader. Could we just add the direct admin request support like
`findAllSuspiciousTransaction` to detect that, by scanning all partition
leaders and transaction coordinators within the cluster, and figure out any
open transaction on the partition leader side not known to any coordinator?

6. typo: "In this case, we the command..." => "In this case, the
command..."

Boyang

On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
wrote:

> >> Would it be worth returning transactional.id.expiration.ms in the
> DescribeProducersResponse?
>
> > That's an interesting thought as well. Are you trying to avoid the need
> to
> specify it through the command line? The tool could also query the value
> with DescribeConfigs I suppose.
>
> Basically. I'm not sure how useful this will be in practice, though it
> might help when debugging.
>
> Lucas
>
> On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Lucas,
> >
> > Thanks for the comments. Responses below:
> >
> > > Given that it's possible for replica producer states to diverge from
> each
> > other, it would be very useful if DescribeProducers(Request,Response) and
> > tooling is able to query all partition replicas for their producers
> >
> > Yes, it makes sense to me to let DescribeProducers work on both followers
> > and leaders. In fact, I'm encouraged that there are use cases for this
> work
> > other than detecting hanging transactions. That was indeed the hope, but
> I
> > didn't have anything specific in mind. I will update the proposal.
> >
> > > Would it be worth returning transactional.id.expiration.ms in the
> > DescribeProducersResponse?
> >
> > That's an interesting thought as well. Are you trying to avoid the need
> to
> > specify it through the command line? The tool could also query the value
> > with DescribeConfigs I suppose.
> >
> > Thanks,
> > Jason
> >
> > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io>
> > wrote:
> >
> > > Hi Jason,
> > >
> > > This looks like a very useful tool, thanks for writing it up.
> > >
> > > Given that it's possible for replica producer states to diverge from
> each
> > > other, it would be very useful if DescribeProducers(Request,Response)
> and
> > > tooling is able to query all partition replicas for their producers.
> One
> > > way I can see this being used immediately is in kafka's system tests,
> > > especially the ones that inject failures. At the end of the test we can
> > > query all replicas and make sure that their states have not diverged. I
> > can
> > > also see it being useful when debugging production clusters too.
> > >
> > > Would it be worth returning transactional.id.expiration.ms in the
> > > DescribeProducersResponse?
> > >
> > > Cheers,
> > >
> > > Lucas
> > >
> > >
> > >
> > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rn...@gmail.com>
> > wrote:
> > >
> > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > >
> > > > Ron
> > > >
> > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hey Ron,
> > > > >
> > > > > We do not typically backport new APIs to older versions. I think we
> > can
> > > > > however make the --abort command compatible with older versions. It
> > > would
> > > > > require a user to do some analysis on their own to identify a
> hanging
> > > > > transaction, but then they can use the tool from a new release to
> > > > recover.
> > > > > For example, users could detect a hanging transaction through the
> > > > existing
> > > > > "LastStableOffsetLag" metric and then collect the needed
> information
> > > > from a
> > > > > dump of the log (or producer snapshot). It's more work, but at
> least
> > > it's
> > > > > possible. Does that sound fair?
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rn...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > >
> > > > > > Will the implementation be backported to prior Kafka versions?
> The
> > > > > reason
> > > > > > I ask is because if it is not backported and similar
> functionality
> > is
> > > > not
> > > > > > otherwise made available for older versions, then the only
> recourse
> > > > > (aside
> > > > > > from deleting and recreating the topic as you pointed out) may be
> > to
> > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > functionality).
> > > > > > Such an upgrade may not be desirable, especially if the number of
> > > > > > intermediate versions is considerable. I understand the mantra of
> > > > "never
> > > > > > fall too many versions behind" but the reality of it is that it
> > isn't
> > > > > > always the case.  Even if the version is relatively recent, an
> > > upgrade
> > > > > may
> > > > > > still not be possible for some time, and a quicker resolution may
> > be
> > > > > > necessary.
> > > > > >
> > > > > > Ron
> > > > > >
> > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > jason@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I've added a proposal to handle the problem of hanging
> > > transactions:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > .
> > > > > > > In theory, this should never happen. In practice, we have hit
> one
> > > bug
> > > > > > where
> > > > > > > it was possible and there are few good options today to
> recover.
> > > > Take a
> > > > > > > look and let me know what you think.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Jason Gustafson <ja...@confluent.io>.
Hi Bob,

Thanks for the comment.

> I'm not sure how much value the MaxActiveTransactionDuration metric adds,
given that we have the --find-hanging option in the tool. As you mention,
instances of these transactions are expected to be rare, and a
partition-level metric, which can generate a lot of data, seems very
heavyweight for such a rare occurrence. I think "alert on
PartitionsWithLateTransactionsCount" followed by "run kafka-transactions
--find-hanging on the relevant broker" is a reasonable process for cluster
operators to follow.

Yeah, I see your point. My original thinking was that this metric might
help an operator identify partitions specifically, but that was before I
added the --find-hanging command. I guess we can remove it.

-Jason




On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Guozhang,
>
> Thanks for the detailed comments. Responses inline:
>
> > 1. I'd like to clarify how we can make "--abort" work with old brokers,
> since without the additional field "Partitions" the tool needs to set the
> coordinator epoch correctly instead of "-1"? Arguably that's still doable
> but would require different call paths, and it's not clear whether that's
> worth doing for old versions.
>
> That's a good question. What I had in mind was to write the marker using
> the last coordinator epoch that was used by the respective ProducerId. I
> realized that I left the coordinator epoch out of the `DescribeProducers`
> response, so I have updated the KIP to include it. It is possible that
> there is no coordinator epoch associated with a given ProducerId (e.g. if
> it is the first transaction from that producer), but in this case we can
> use 0.
>
> As for whether this is worth doing, I guess I would be more inclined to
> leave it out if users had a reasonable alternative today to address this
> problem.
>
> > 2. Why do we have to enforce "DescribeProducers" to be sent to only
> leaders
> while ListTransactions can be sent to any brokers? Or is it really
> "ListTransactions to be sent to coordinators only"? From the workflow
> you've described, based on the results back from DescribeProducers, we
> should just immediately send ListTransactions to the
> corresponding coordinators based on the collected producer ids, instead of
> trying to send to any brokers right?
>
> I'm going to change `DescribeProducers` so that it can be handled by any
> replica of a topic partition. This was suggested by Lucas in order to allow
> this API to be used for replica consistency testing. As far as
> `ListTransactions`, I was treating this similarly to `ListGroups`. Although
> we know that the coordinators are the leaders of the __transaction_state
> partitions, this is more of an implementation detail. From an API
> perspective, we say that any broker could be a transaction coordinator.
>
> > 3. One thing I'm a bit hesitant about is that, is `Describe` permission
> on
> the associated topic sufficient to allow any users to get all producer
> information writing to the specific topic-partitions including last
> timestamp, txn-start-timestamp etc, which may be considered sensitive?
> Should we require "ClusterAction" to only allow operators only?
>
> That's a fair point. Do you think `Read` permission would be reasonable?
> This is all information that could be obtained by reading the topic.
>
> > 4. From the example it seems "TxnStartOffset" should be included in the
> DescribeTransaction response schema? Otherwise the user would not get it in
> the following WriteTxnMarker request.
>
> The `DescribeTransaction` API is sent to the transaction coordinator,
> which does not know the start offset of a transaction in each topic
> partition. That is why we need `DescribeProducers`.
>
> > 5. It is a bit easier for readers to highlight the added fields in the
> existing WriteTxnMarkerRequest (btw I read is that we are only adding
> "Partitions" with the starting offset, right?). Also as for its response it
> seems we do not make any schema changes except adding one more potential
> error code "INVALID_TXN_STATE" to it, right? If that's the case we can just
> state that explicitly.
>
> I highlighted the new field in the request. For the response, the KIP
> states the following: "There are no changes to the response schema, but it
> will be bumped. Note that we are also enabling flexible version support."
>
> > 6. It is not clear to me for the overloaded function that the following
> option classes are not specified, what should be the default options?
> ...
>
> I was just trying to stick with existing conventions, but I will add some
> more detail here. I think we should probably still include
> `AbortTransactionOptions`. The `Options` classes are how users override
> timeouts.
>
> > 7.1 Is "--broker" a required or optional (in that case I presume we would
> just query all brokers iteratively) in "--find-hanging"?
>
> I think it should be required as a reasonable way to limit the scope of
> the search. This is meant to be guided by metrics after all. If we do not
> limit the scope to a single broker, then the behavior might get worse as
> the cluster grows. I will clarify this.
>
> > 7.2 Seems "list-producers" is not exposed as a standalone feature in the
> cmd but only used in the wrapping "--find-hanging", is that intentional?
> Personally I feel exposing a "--list-producers" may be useful too: if we
> believe the user has the right ACL, it is legitimate to return the producer
> information to her anyways. But that is debatable in the meta point 3)
> above.
>
> Yeah, I was planning to add this to support the use case that Lucas
> mentioned. There is some awkwardness since it is a little difficult to
> convey different sources of information through the same command. I guess
> we can do `--list producers` and `--list transactions` and explain in the
> documentation. Maybe that is good enough.
>
> > 7.3 "Describing Transactions": we should also explain how that would be
> executed, e.g. at least we should clarify that we would first find the
> coordinator based on the transactional.id and hence users do not need to
> specify one.
>
> Sure, makes sense.
>
> > 7.4. In "Aborting Transactions", should we also specify the "--broker"
> node
> as a required option? Otherwise we would not know which broker to send to.
>
> The --topic and --partition arguments are required, so the target is
> always the leader of that partition.
>
>
> Thanks,
> Jason
>
>
>
> On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <bo...@confluent.io>
> wrote:
>
>> Hi Jason,
>>
>> Thanks for this KIP, I think this will be a huge operational improvement
>> and overall it looks great to me.
>>
>> I'm not sure how much value the MaxActiveTransactionDuration metric adds,
>> given that we have the --find-hanging option in the tool. As you mention,
>> instances of these transactions are expected to be rare, and a
>> partition-level metric, which can generate a lot of data, seems very
>> heavyweight for such a rare occurrence. I think "alert on
>> PartitionsWithLateTransactionsCount" followed by "run kafka-transactions
>> --find-hanging on the relevant broker" is a reasonable process for cluster
>> operators to follow.
>>
>> Thanks,
>> Bob
>>
>> On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <wa...@gmail.com> wrote:
>>
>> > Hi Jason,
>> >
>> > Thanks for the written KIP. I think this is going to be a very useful
>> tool
>> > for operational improvements since with eos in its current stage, we
>> cannot
>> > confidently assert that we are bug-free, and even in the future when we
>> are
>> > confident this is still going to be leveraged by older versioned
>> brokers.
>> > Regarding the solution, I've also debated myself whether Kafka should
>> > "self-heal" automatically when detected in such situations, or should we
>> > instead build into ecosystem tooling to let operators do it. And I've
>> also
>> > convinced myself that the latter should be a better solution to keep
>> Kafka
>> > software itself simpler.
>> >
>> > Regarding the KIP itself, I have a few meta comments below:
>> >
>> > 1. I'd like to clarify how we can make "--abort" work with old brokers,
>> > since without the additional field "Partitions" the tool needs to set
>> the
>> > coordinator epoch correctly instead of "-1"? Arguably that's still
>> doable
>> > but would require different call paths, and it's not clear whether
>> that's
>> > worth doing for old versions.
>> >
>> > 2. Why do we have to enforce "DescribeProducers" to be sent to only
>> leaders
>> > while ListTransactions can be sent to any brokers? Or is it really
>> > "ListTransactions to be sent to coordinators only"? From the workflow
>> > you've described, based on the results back from DescribeProducers, we
>> > should just immediately send ListTransactions to the
>> > corresponding coordinators based on the collected producer ids, instead
>> of
>> > trying to send to any brokers right?
>> >
>> > Also I'm a bit concerned if "ListTransactions" could potentially return
>> too
>> > much data with "StateFilters" set to all states, including completed
>> ones.
>> > Do we expect users ever want to know transactions that are not pending?
>> On
>> > the other hand, maybe we can just require users to specify the "pids[]"
>> in
>> > this request too to further filter those un-interested transactions.
>> This
>> > also works well with the workflow: we know exactly from
>> "DescribeProducers"
>> > which pids are we diagnosing right now, so in the follow-up
>> > "ListTransactions" we should also only care for those partitions only.
>> >
>> > 3. One thing I'm a bit hesitant about is that, is `Describe` permission
>> on
>> > the associated topic sufficient to allow any users to get all producer
>> > information writing to the specific topic-partitions including last
>> > timestamp, txn-start-timestamp etc, which may be considered sensitive?
>> > Should we require "ClusterAction" to only allow operators only?
>> >
>> > Below are more detailed comments:
>> >
>> > 4. From the example it seems "TxnStartOffset" should be included in the
>> > DescribeTransaction response schema? Otherwise the user would not get
>> it in
>> > the following WriteTxnMarker request.
>> >
>> > 5. It is a bit easier for readers to highlight the added fields in the
>> > existing WriteTxnMarkerRequest (btw I read is that we are only adding
>> > "Partitions" with the starting offset, right?). Also as for its
>> response it
>> > seems we do not make any schema changes except adding one more potential
>> > error code "INVALID_TXN_STATE" to it, right? If that's the case we can
>> just
>> > state that explicitly.
>> >
>> > 6. It is not clear to me for the overloaded function that the following
>> > option classes are not specified, what should be the default options?
>> >
>> > * ListTransactionsOptions
>> > * DescribeTransactionsOptions
>> > * DescribeProducersOptions
>> >
>> > Also, it seems AbortTransactionOptions would just be empty? If yes do we
>> > really need this option class for now?
>> >
>> > 7. A couple questions from the cmd tool examples:
>> > 7.1 Is "--broker" a required or optional (in that case I presume we
>> would
>> > just query all brokers iteratively) in "--find-hanging"?
>> > 7.2 Seems "list-producers" is not exposed as a standalone feature in the
>> > cmd but only used in the wrapping "--find-hanging", is that intentional?
>> > Personally I feel exposing a "--list-producers" may be useful too: if we
>> > believe the user has the right ACL, it is legitimate to return the
>> producer
>> > information to her anyways. But that is debatable in the meta point 3)
>> > above.
>> > 7.3 "Describing Transactions": we should also explain how that would be
>> > executed, e.g. at least we should clarify that we would first find the
>> > coordinator based on the transactional.id and hence users do not need
>> to
>> > specify one.
>> > 7.4. In "Aborting Transactions", should we also specify the "--broker"
>> node
>> > as a required option? Otherwise we would not know which broker to send
>> to.
>> >
>> >
>> > Overall, nice written one, thanks Jason.
>> >
>> > Guozhang
>> >
>> >
>> > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
>> > wrote:
>> >
>> > > >> Would it be worth returning transactional.id.expiration.ms in the
>> > > DescribeProducersResponse?
>> > >
>> > > > That's an interesting thought as well. Are you trying to avoid the
>> need
>> > > to
>> > > specify it through the command line? The tool could also query the
>> value
>> > > with DescribeConfigs I suppose.
>> > >
>> > > Basically. I'm not sure how useful this will be in practice, though it
>> > > might help when debugging.
>> > >
>> > > Lucas
>> > >
>> > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io>
>> > > wrote:
>> > >
>> > > > Hey Lucas,
>> > > >
>> > > > Thanks for the comments. Responses below:
>> > > >
>> > > > > Given that it's possible for replica producer states to diverge
>> from
>> > > each
>> > > > other, it would be very useful if
>> DescribeProducers(Request,Response)
>> > and
>> > > > tooling is able to query all partition replicas for their producers
>> > > >
>> > > > Yes, it makes sense to me to let DescribeProducers work on both
>> > followers
>> > > > and leaders. In fact, I'm encouraged that there are use cases for
>> this
>> > > work
>> > > > other than detecting hanging transactions. That was indeed the hope,
>> > but
>> > > I
>> > > > didn't have anything specific in mind. I will update the proposal.
>> > > >
>> > > > > Would it be worth returning transactional.id.expiration.ms in the
>> > > > DescribeProducersResponse?
>> > > >
>> > > > That's an interesting thought as well. Are you trying to avoid the
>> need
>> > > to
>> > > > specify it through the command line? The tool could also query the
>> > value
>> > > > with DescribeConfigs I suppose.
>> > > >
>> > > > Thanks,
>> > > > Jason
>> > > >
>> > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
>> lucas@confluent.io>
>> > > > wrote:
>> > > >
>> > > > > Hi Jason,
>> > > > >
>> > > > > This looks like a very useful tool, thanks for writing it up.
>> > > > >
>> > > > > Given that it's possible for replica producer states to diverge
>> from
>> > > each
>> > > > > other, it would be very useful if
>> DescribeProducers(Request,Response)
>> > > and
>> > > > > tooling is able to query all partition replicas for their
>> producers.
>> > > One
>> > > > > way I can see this being used immediately is in kafka's system
>> tests,
>> > > > > especially the ones that inject failures. At the end of the test
>> we
>> > can
>> > > > > query all replicas and make sure that their states have not
>> > diverged. I
>> > > > can
>> > > > > also see it being useful when debugging production clusters too.
>> > > > >
>> > > > > Would it be worth returning transactional.id.expiration.ms in the
>> > > > > DescribeProducersResponse?
>> > > > >
>> > > > > Cheers,
>> > > > >
>> > > > > Lucas
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rndgstn@gmail.com
>> >
>> > > > wrote:
>> > > > >
>> > > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
>> > > > > >
>> > > > > > Ron
>> > > > > >
>> > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
>> > jason@confluent.io>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hey Ron,
>> > > > > > >
>> > > > > > > We do not typically backport new APIs to older versions. I
>> think
>> > we
>> > > > can
>> > > > > > > however make the --abort command compatible with older
>> versions.
>> > It
>> > > > > would
>> > > > > > > require a user to do some analysis on their own to identify a
>> > > hanging
>> > > > > > > transaction, but then they can use the tool from a new
>> release to
>> > > > > > recover.
>> > > > > > > For example, users could detect a hanging transaction through
>> the
>> > > > > > existing
>> > > > > > > "LastStableOffsetLag" metric and then collect the needed
>> > > information
>> > > > > > from a
>> > > > > > > dump of the log (or producer snapshot). It's more work, but at
>> > > least
>> > > > > it's
>> > > > > > > possible. Does that sound fair?
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > > Jason
>> > > > > > >
>> > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
>> > rndgstn@gmail.com>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
>> > > > > > > >
>> > > > > > > > Will the implementation be backported to prior Kafka
>> versions?
>> > > The
>> > > > > > > reason
>> > > > > > > > I ask is because if it is not backported and similar
>> > > functionality
>> > > > is
>> > > > > > not
>> > > > > > > > otherwise made available for older versions, then the only
>> > > recourse
>> > > > > > > (aside
>> > > > > > > > from deleting and recreating the topic as you pointed out)
>> may
>> > be
>> > > > to
>> > > > > > > > upgrade to 2.7 (or whatever version ends up getting this
>> > > > > > functionality).
>> > > > > > > > Such an upgrade may not be desirable, especially if the
>> number
>> > of
>> > > > > > > > intermediate versions is considerable. I understand the
>> mantra
>> > of
>> > > > > > "never
>> > > > > > > > fall too many versions behind" but the reality of it is
>> that it
>> > > > isn't
>> > > > > > > > always the case.  Even if the version is relatively recent,
>> an
>> > > > > upgrade
>> > > > > > > may
>> > > > > > > > still not be possible for some time, and a quicker
>> resolution
>> > may
>> > > > be
>> > > > > > > > necessary.
>> > > > > > > >
>> > > > > > > > Ron
>> > > > > > > >
>> > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
>> > > > jason@confluent.io>
>> > > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi All,
>> > > > > > > > >
>> > > > > > > > > I've added a proposal to handle the problem of hanging
>> > > > > transactions:
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
>> > > > > > > > > .
>> > > > > > > > > In theory, this should never happen. In practice, we have
>> hit
>> > > one
>> > > > > bug
>> > > > > > > > where
>> > > > > > > > > it was possible and there are few good options today to
>> > > recover.
>> > > > > > Take a
>> > > > > > > > > look and let me know what you think.
>> > > > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > > Jason
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> > --
>> > -- Guozhang
>> >
>>
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Jason Gustafson <ja...@confluent.io>.
Hey Boyang,

Thanks for the comments. Responses below:

> 1. Why do we need to use type string for `StatesFilter` instead of a short
value, as we could translate it and save space?

I went back and forth on this. In the end I used a string for consistency
with `DescribeGroups`. I doubt space is much of an issue because there are
not that many states and these requests should be infrequent.

> 2. I'm wondering whether the requirement for Describe permission on
TransactionalId works when we are heading towards
https://issues.apache.org/jira/browse/KAFKA-9454, where we could rely on
consumer group id instead of defining the transactional id. At a first
look, I think it should be ok but just want to raise this point.

Yeah, I don't think we're making anything worse in any case. It would be
easier to say if we had a concrete proposal for KAFKA-9454.

> 3. Could the --find-hanging work with checking all brokers in the cluster,
or multiple brokers as a list?

Someone else asked about this above. It could do that, but I think it's
important to limit the scope of the search so that it's behavior is
reasonably bounded without respect to the cluster size. The tool is
intended to be guided by the new metrics. Perhaps we can reconsider this if
users ask for it?

> 4. Similar to transaction abortion, I guess there is a trade-off for
too-specific vs too-general for the required number of arguments. However,
supposedly I would like to wipe out all the associated transactions with
the given transactional id, or I want to clean up *all *hanging
transactions in the cluster, do I need to write the script on my own? Maybe
we could discuss a bit on whether we would like to support a more holistic
API, or this is good for now.

There are definitely further improvements that are possible. My main goal
was to get a basic API in place with some reasonable safeguards. Although I
did consider more focused APIs to detect hanging transactions directly, I
decided it was better to keep the APIs simple and build the complexity into
the tool. The nice thing is that this gives us all the building blocks to
extend the tool in the future. I hope that won't be needed of course. We
know of one case so far that can cause hanging transactions. If we start
seeing more, it's probably a sign that we are doing a poor job testing and
verifying our implementation and that our time had better be spent
improving that.


Thanks,
Jason



On Thu, Sep 17, 2020 at 11:12 AM Boyang Chen <re...@gmail.com>
wrote:

> Thanks for the updates Jason. I'm pretty satisfied with the overall
> motivation and proposed solution, just a couple of more comments.
>
> 1. Why do we need to use type string for `StatesFilter` instead of a short
> value, as we could translate it and save space?
>
> 2. I'm wondering whether the requirement for Describe permission on
> TransactionalId works when we are heading towards
> https://issues.apache.org/jira/browse/KAFKA-9454, where we could rely on
> consumer group id instead of defining the transactional id. At a first
> look, I think it should be ok but just want to raise this point.
>
> 3. Could the --find-hanging work with checking all brokers in the cluster,
> or multiple brokers as a list?
>
> 4. Similar to transaction abortion, I guess there is a trade-off for
> too-specific vs too-general for the required number of arguments. However,
> supposedly I would like to wipe out all the associated transactions with
> the given transactional id, or I want to clean up *all *hanging
> transactions in the cluster, do I need to write the script on my own? Maybe
> we could discuss a bit on whether we would like to support a more holistic
> API, or this is good for now.
>
>
> On Thu, Sep 10, 2020 at 7:53 AM Tom Bentley <tb...@redhat.com> wrote:
>
> > Sounds good to me, thanks!
> >
> > On Wed, Sep 9, 2020 at 5:30 PM Jason Gustafson <ja...@confluent.io>
> wrote:
> >
> > > Hey Tom,
> > >
> > > Yeah, that's fair. I will update the proposal. I was also thinking of
> > > adding a separate column for duration, just to save users the trouble
> of
> > > computing it.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Wed, Sep 9, 2020 at 1:21 AM Tom Bentley <tb...@redhat.com>
> wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > The KIP looks good to me, but I had one question. AFAIU the
> > LastTimestamp
> > > > column in the output of --describe-producers and --find-hanging is
> > there
> > > so
> > > > the users of the tool know the txnLastUpdateTimestamp of the
> > > > TransactionMetadata and from that and the (max) timeout can infer
> > > something
> > > > about the likelihood that this really is a stuck transaction. If
> that's
> > > the
> > > > case then what is the benefit in displaying it as a ms offset from
> the
> > > unix
> > > > epoch, rather than an actual date time?
> > > >
> > > > Thanks,
> > > >
> > > > Tom
> > > >
> > > > On Mon, Aug 31, 2020 at 11:28 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > Thanks Jason, I do not have more comments on the KIP then.
> > > > >
> > > > > On Mon, Aug 31, 2020 at 3:19 PM Jason Gustafson <
> jason@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > > Hmm, but the "TxnStartOffset" is not included in the
> > > > DescribeProducers
> > > > > > response either?
> > > > > >
> > > > > > Oh, I accidentally called it `CurrentTxnStartTimestamp` in the
> > > schema.
> > > > > > Fixed now!
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > > On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <
> wangguoz@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <
> > > jason@confluent.io
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Guozhang,
> > > > > > > >
> > > > > > > > Thanks for the detailed comments. Responses inline:
> > > > > > > >
> > > > > > > > > 1. I'd like to clarify how we can make "--abort" work with
> > old
> > > > > > brokers,
> > > > > > > > since without the additional field "Partitions" the tool
> needs
> > to
> > > > set
> > > > > > the
> > > > > > > > coordinator epoch correctly instead of "-1"? Arguably that's
> > > still
> > > > > > doable
> > > > > > > > but would require different call paths, and it's not clear
> > > whether
> > > > > > that's
> > > > > > > > worth doing for old versions.
> > > > > > > >
> > > > > > > > That's a good question. What I had in mind was to write the
> > > marker
> > > > > > using
> > > > > > > > the last coordinator epoch that was used by the respective
> > > > > ProducerId.
> > > > > > I
> > > > > > > > realized that I left the coordinator epoch out of the
> > > > > > `DescribeProducers`
> > > > > > > > response, so I have updated the KIP to include it. It is
> > possible
> > > > > that
> > > > > > > > there is no coordinator epoch associated with a given
> > ProducerId
> > > > > (e.g.
> > > > > > if
> > > > > > > > it is the first transaction from that producer), but in this
> > case
> > > > we
> > > > > > can
> > > > > > > > use 0.
> > > > > > > >
> > > > > > > > As for whether this is worth doing, I guess I would be more
> > > > inclined
> > > > > to
> > > > > > > > leave it out if users had a reasonable alternative today to
> > > address
> > > > > > this
> > > > > > > > problem.
> > > > > > > >
> > > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent
> > to
> > > > only
> > > > > > > > leaders
> > > > > > > > while ListTransactions can be sent to any brokers? Or is it
> > > really
> > > > > > > > "ListTransactions to be sent to coordinators only"? From the
> > > > workflow
> > > > > > > > you've described, based on the results back from
> > > DescribeProducers,
> > > > > we
> > > > > > > > should just immediately send ListTransactions to the
> > > > > > > > corresponding coordinators based on the collected producer
> ids,
> > > > > instead
> > > > > > > of
> > > > > > > > trying to send to any brokers right?
> > > > > > > >
> > > > > > > > I'm going to change `DescribeProducers` so that it can be
> > handled
> > > > by
> > > > > > any
> > > > > > > > replica of a topic partition. This was suggested by Lucas in
> > > order
> > > > to
> > > > > > > allow
> > > > > > > > this API to be used for replica consistency testing. As far
> as
> > > > > > > > `ListTransactions`, I was treating this similarly to
> > > `ListGroups`.
> > > > > > > Although
> > > > > > > > we know that the coordinators are the leaders of the
> > > > > > __transaction_state
> > > > > > > > partitions, this is more of an implementation detail. From an
> > API
> > > > > > > > perspective, we say that any broker could be a transaction
> > > > > coordinator.
> > > > > > > >
> > > > > > > > > 3. One thing I'm a bit hesitant about is that, is
> `Describe`
> > > > > > permission
> > > > > > > > on
> > > > > > > > the associated topic sufficient to allow any users to get all
> > > > > producer
> > > > > > > > information writing to the specific topic-partitions
> including
> > > last
> > > > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > > > sensitive?
> > > > > > > > Should we require "ClusterAction" to only allow operators
> only?
> > > > > > > >
> > > > > > > > That's a fair point. Do you think `Read` permission would be
> > > > > > reasonable?
> > > > > > > > This is all information that could be obtained by reading the
> > > > topic.
> > > > > > > >
> > > > > > > > Yeah that makes sense.
> > > > > > >
> > > > > > >
> > > > > > > > > 4. From the example it seems "TxnStartOffset" should be
> > > included
> > > > in
> > > > > > the
> > > > > > > > DescribeTransaction response schema? Otherwise the user would
> > not
> > > > get
> > > > > > it
> > > > > > > in
> > > > > > > > the following WriteTxnMarker request.
> > > > > > > >
> > > > > > > > The `DescribeTransaction` API is sent to the transaction
> > > > coordinator,
> > > > > > > which
> > > > > > > > does not know the start offset of a transaction in each topic
> > > > > > partition.
> > > > > > > > That is why we need `DescribeProducers`.
> > > > > > > >
> > > > > > >
> > > > > > > Hmm, but the "TxnStartOffset" is not included in the
> > > > DescribeProducers
> > > > > > > response either?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > 5. It is a bit easier for readers to highlight the added
> > fields
> > > > in
> > > > > > the
> > > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are
> only
> > > > adding
> > > > > > > > "Partitions" with the starting offset, right?). Also as for
> its
> > > > > > response
> > > > > > > it
> > > > > > > > seems we do not make any schema changes except adding one
> more
> > > > > > potential
> > > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the
> case
> > > we
> > > > > can
> > > > > > > just
> > > > > > > > state that explicitly.
> > > > > > > >
> > > > > > > > I highlighted the new field in the request. For the response,
> > the
> > > > KIP
> > > > > > > > states the following: "There are no changes to the response
> > > schema,
> > > > > but
> > > > > > > it
> > > > > > > > will be bumped. Note that we are also enabling flexible
> version
> > > > > > support."
> > > > > > > >
> > > > > > > > > 6. It is not clear to me for the overloaded function that
> the
> > > > > > following
> > > > > > > > option classes are not specified, what should be the default
> > > > options?
> > > > > > > > ...
> > > > > > > >
> > > > > > > > I was just trying to stick with existing conventions, but I
> > will
> > > > add
> > > > > > some
> > > > > > > > more detail here. I think we should probably still include
> > > > > > > > `AbortTransactionOptions`. The `Options` classes are how
> users
> > > > > override
> > > > > > > > timeouts.
> > > > > > > >
> > > > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> > > presume
> > > > we
> > > > > > > would
> > > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > > >
> > > > > > > > I think it should be required as a reasonable way to limit
> the
> > > > scope
> > > > > of
> > > > > > > the
> > > > > > > > search. This is meant to be guided by metrics after all. If
> we
> > do
> > > > not
> > > > > > > limit
> > > > > > > > the scope to a single broker, then the behavior might get
> worse
> > > as
> > > > > the
> > > > > > > > cluster grows. I will clarify this.
> > > > > > > >
> > > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> > > feature
> > > > > in
> > > > > > > the
> > > > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > > > intentional?
> > > > > > > > Personally I feel exposing a "--list-producers" may be useful
> > > too:
> > > > if
> > > > > > we
> > > > > > > > believe the user has the right ACL, it is legitimate to
> return
> > > the
> > > > > > > producer
> > > > > > > > information to her anyways. But that is debatable in the meta
> > > point
> > > > > 3)
> > > > > > > > above.
> > > > > > > >
> > > > > > > > Yeah, I was planning to add this to support the use case that
> > > Lucas
> > > > > > > > mentioned. There is some awkwardness since it is a little
> > > difficult
> > > > > to
> > > > > > > > convey different sources of information through the same
> > > command. I
> > > > > > guess
> > > > > > > > we can do `--list producers` and `--list transactions` and
> > > explain
> > > > in
> > > > > > the
> > > > > > > > documentation. Maybe that is good enough.
> > > > > > > >
> > > > > > > > > 7.3 "Describing Transactions": we should also explain how
> > that
> > > > > would
> > > > > > be
> > > > > > > > executed, e.g. at least we should clarify that we would first
> > > find
> > > > > the
> > > > > > > > coordinator based on the transactional.id and hence users do
> > not
> > > > > need
> > > > > > to
> > > > > > > > specify one.
> > > > > > > >
> > > > > > > > Sure, makes sense.
> > > > > > > >
> > > > > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > > > > "--broker"
> > > > > > > > node
> > > > > > > > as a required option? Otherwise we would not know which
> broker
> > to
> > > > > send
> > > > > > > to.
> > > > > > > >
> > > > > > > > The --topic and --partition arguments are required, so the
> > target
> > > > is
> > > > > > > always
> > > > > > > > the leader of that partition.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <
> > > > > > bob.barrett@confluent.io
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > Thanks for this KIP, I think this will be a huge
> operational
> > > > > > > improvement
> > > > > > > > > and overall it looks great to me.
> > > > > > > > >
> > > > > > > > > I'm not sure how much value the
> MaxActiveTransactionDuration
> > > > metric
> > > > > > > adds,
> > > > > > > > > given that we have the --find-hanging option in the tool.
> As
> > > you
> > > > > > > mention,
> > > > > > > > > instances of these transactions are expected to be rare,
> and
> > a
> > > > > > > > > partition-level metric, which can generate a lot of data,
> > seems
> > > > > very
> > > > > > > > > heavyweight for such a rare occurrence. I think "alert on
> > > > > > > > > PartitionsWithLateTransactionsCount" followed by "run
> > > > > > > kafka-transactions
> > > > > > > > > --find-hanging on the relevant broker" is a reasonable
> > process
> > > > for
> > > > > > > > cluster
> > > > > > > > > operators to follow.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Bob
> > > > > > > > >
> > > > > > > > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <
> > > > wangguoz@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > Thanks for the written KIP. I think this is going to be a
> > > very
> > > > > > useful
> > > > > > > > > tool
> > > > > > > > > > for operational improvements since with eos in its
> current
> > > > stage,
> > > > > > we
> > > > > > > > > cannot
> > > > > > > > > > confidently assert that we are bug-free, and even in the
> > > future
> > > > > > when
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > > confident this is still going to be leveraged by older
> > > > versioned
> > > > > > > > brokers.
> > > > > > > > > > Regarding the solution, I've also debated myself whether
> > > Kafka
> > > > > > should
> > > > > > > > > > "self-heal" automatically when detected in such
> situations,
> > > or
> > > > > > should
> > > > > > > > we
> > > > > > > > > > instead build into ecosystem tooling to let operators do
> > it.
> > > > And
> > > > > > I've
> > > > > > > > > also
> > > > > > > > > > convinced myself that the latter should be a better
> > solution
> > > to
> > > > > > keep
> > > > > > > > > Kafka
> > > > > > > > > > software itself simpler.
> > > > > > > > > >
> > > > > > > > > > Regarding the KIP itself, I have a few meta comments
> below:
> > > > > > > > > >
> > > > > > > > > > 1. I'd like to clarify how we can make "--abort" work
> with
> > > old
> > > > > > > brokers,
> > > > > > > > > > since without the additional field "Partitions" the tool
> > > needs
> > > > to
> > > > > > set
> > > > > > > > the
> > > > > > > > > > coordinator epoch correctly instead of "-1"? Arguably
> > that's
> > > > > still
> > > > > > > > doable
> > > > > > > > > > but would require different call paths, and it's not
> clear
> > > > > whether
> > > > > > > > that's
> > > > > > > > > > worth doing for old versions.
> > > > > > > > > >
> > > > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be
> sent
> > > to
> > > > > only
> > > > > > > > > leaders
> > > > > > > > > > while ListTransactions can be sent to any brokers? Or is
> it
> > > > > really
> > > > > > > > > > "ListTransactions to be sent to coordinators only"? From
> > the
> > > > > > workflow
> > > > > > > > > > you've described, based on the results back from
> > > > > DescribeProducers,
> > > > > > > we
> > > > > > > > > > should just immediately send ListTransactions to the
> > > > > > > > > > corresponding coordinators based on the collected
> producer
> > > ids,
> > > > > > > instead
> > > > > > > > > of
> > > > > > > > > > trying to send to any brokers right?
> > > > > > > > > >
> > > > > > > > > > Also I'm a bit concerned if "ListTransactions" could
> > > > potentially
> > > > > > > return
> > > > > > > > > too
> > > > > > > > > > much data with "StateFilters" set to all states,
> including
> > > > > > completed
> > > > > > > > > ones.
> > > > > > > > > > Do we expect users ever want to know transactions that
> are
> > > not
> > > > > > > pending?
> > > > > > > > > On
> > > > > > > > > > the other hand, maybe we can just require users to
> specify
> > > the
> > > > > > > "pids[]"
> > > > > > > > > in
> > > > > > > > > > this request too to further filter those un-interested
> > > > > > transactions.
> > > > > > > > This
> > > > > > > > > > also works well with the workflow: we know exactly from
> > > > > > > > > "DescribeProducers"
> > > > > > > > > > which pids are we diagnosing right now, so in the
> follow-up
> > > > > > > > > > "ListTransactions" we should also only care for those
> > > > partitions
> > > > > > > only.
> > > > > > > > > >
> > > > > > > > > > 3. One thing I'm a bit hesitant about is that, is
> > `Describe`
> > > > > > > permission
> > > > > > > > > on
> > > > > > > > > > the associated topic sufficient to allow any users to get
> > all
> > > > > > > producer
> > > > > > > > > > information writing to the specific topic-partitions
> > > including
> > > > > last
> > > > > > > > > > timestamp, txn-start-timestamp etc, which may be
> considered
> > > > > > > sensitive?
> > > > > > > > > > Should we require "ClusterAction" to only allow operators
> > > only?
> > > > > > > > > >
> > > > > > > > > > Below are more detailed comments:
> > > > > > > > > >
> > > > > > > > > > 4. From the example it seems "TxnStartOffset" should be
> > > > included
> > > > > in
> > > > > > > the
> > > > > > > > > > DescribeTransaction response schema? Otherwise the user
> > would
> > > > not
> > > > > > get
> > > > > > > > it
> > > > > > > > > in
> > > > > > > > > > the following WriteTxnMarker request.
> > > > > > > > > >
> > > > > > > > > > 5. It is a bit easier for readers to highlight the added
> > > fields
> > > > > in
> > > > > > > the
> > > > > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are
> > > only
> > > > > > adding
> > > > > > > > > > "Partitions" with the starting offset, right?). Also as
> for
> > > its
> > > > > > > > response
> > > > > > > > > it
> > > > > > > > > > seems we do not make any schema changes except adding one
> > > more
> > > > > > > > potential
> > > > > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's
> the
> > > case
> > > > > we
> > > > > > > can
> > > > > > > > > just
> > > > > > > > > > state that explicitly.
> > > > > > > > > >
> > > > > > > > > > 6. It is not clear to me for the overloaded function that
> > the
> > > > > > > following
> > > > > > > > > > option classes are not specified, what should be the
> > default
> > > > > > options?
> > > > > > > > > >
> > > > > > > > > > * ListTransactionsOptions
> > > > > > > > > > * DescribeTransactionsOptions
> > > > > > > > > > * DescribeProducersOptions
> > > > > > > > > >
> > > > > > > > > > Also, it seems AbortTransactionOptions would just be
> empty?
> > > If
> > > > > yes
> > > > > > do
> > > > > > > > we
> > > > > > > > > > really need this option class for now?
> > > > > > > > > >
> > > > > > > > > > 7. A couple questions from the cmd tool examples:
> > > > > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> > > > presume
> > > > > we
> > > > > > > > would
> > > > > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> > > > feature
> > > > > > in
> > > > > > > > the
> > > > > > > > > > cmd but only used in the wrapping "--find-hanging", is
> that
> > > > > > > > intentional?
> > > > > > > > > > Personally I feel exposing a "--list-producers" may be
> > useful
> > > > > too:
> > > > > > if
> > > > > > > > we
> > > > > > > > > > believe the user has the right ACL, it is legitimate to
> > > return
> > > > > the
> > > > > > > > > producer
> > > > > > > > > > information to her anyways. But that is debatable in the
> > meta
> > > > > point
> > > > > > > 3)
> > > > > > > > > > above.
> > > > > > > > > > 7.3 "Describing Transactions": we should also explain how
> > > that
> > > > > > would
> > > > > > > be
> > > > > > > > > > executed, e.g. at least we should clarify that we would
> > first
> > > > > find
> > > > > > > the
> > > > > > > > > > coordinator based on the transactional.id and hence
> users
> > do
> > > > not
> > > > > > > need
> > > > > > > > to
> > > > > > > > > > specify one.
> > > > > > > > > > 7.4. In "Aborting Transactions", should we also specify
> the
> > > > > > > "--broker"
> > > > > > > > > node
> > > > > > > > > > as a required option? Otherwise we would not know which
> > > broker
> > > > to
> > > > > > > send
> > > > > > > > > to.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Overall, nice written one, thanks Jason.
> > > > > > > > > >
> > > > > > > > > > Guozhang
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <
> > > > > > > lucas@confluent.io>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > >> Would it be worth returning
> > > > transactional.id.expiration.ms
> > > > > in
> > > > > > > the
> > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > >
> > > > > > > > > > > > That's an interesting thought as well. Are you trying
> > to
> > > > > avoid
> > > > > > > the
> > > > > > > > > need
> > > > > > > > > > > to
> > > > > > > > > > > specify it through the command line? The tool could
> also
> > > > query
> > > > > > the
> > > > > > > > > value
> > > > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > > > >
> > > > > > > > > > > Basically. I'm not sure how useful this will be in
> > > practice,
> > > > > > though
> > > > > > > > it
> > > > > > > > > > > might help when debugging.
> > > > > > > > > > >
> > > > > > > > > > > Lucas
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <
> > > > > > > jason@confluent.io
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Lucas,
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks for the comments. Responses below:
> > > > > > > > > > > >
> > > > > > > > > > > > > Given that it's possible for replica producer
> states
> > to
> > > > > > diverge
> > > > > > > > > from
> > > > > > > > > > > each
> > > > > > > > > > > > other, it would be very useful if
> > > > > > > > DescribeProducers(Request,Response)
> > > > > > > > > > and
> > > > > > > > > > > > tooling is able to query all partition replicas for
> > their
> > > > > > > producers
> > > > > > > > > > > >
> > > > > > > > > > > > Yes, it makes sense to me to let DescribeProducers
> work
> > > on
> > > > > both
> > > > > > > > > > followers
> > > > > > > > > > > > and leaders. In fact, I'm encouraged that there are
> use
> > > > cases
> > > > > > for
> > > > > > > > > this
> > > > > > > > > > > work
> > > > > > > > > > > > other than detecting hanging transactions. That was
> > > indeed
> > > > > the
> > > > > > > > hope,
> > > > > > > > > > but
> > > > > > > > > > > I
> > > > > > > > > > > > didn't have anything specific in mind. I will update
> > the
> > > > > > > proposal.
> > > > > > > > > > > >
> > > > > > > > > > > > > Would it be worth returning
> > > > transactional.id.expiration.ms
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > > >
> > > > > > > > > > > > That's an interesting thought as well. Are you trying
> > to
> > > > > avoid
> > > > > > > the
> > > > > > > > > need
> > > > > > > > > > > to
> > > > > > > > > > > > specify it through the command line? The tool could
> > also
> > > > > query
> > > > > > > the
> > > > > > > > > > value
> > > > > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Jason
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > > > > > > > > lucas@confluent.io>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jason,
> > > > > > > > > > > > >
> > > > > > > > > > > > > This looks like a very useful tool, thanks for
> > writing
> > > it
> > > > > up.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Given that it's possible for replica producer
> states
> > to
> > > > > > diverge
> > > > > > > > > from
> > > > > > > > > > > each
> > > > > > > > > > > > > other, it would be very useful if
> > > > > > > > > DescribeProducers(Request,Response)
> > > > > > > > > > > and
> > > > > > > > > > > > > tooling is able to query all partition replicas for
> > > their
> > > > > > > > > producers.
> > > > > > > > > > > One
> > > > > > > > > > > > > way I can see this being used immediately is in
> > kafka's
> > > > > > system
> > > > > > > > > tests,
> > > > > > > > > > > > > especially the ones that inject failures. At the
> end
> > of
> > > > the
> > > > > > > test
> > > > > > > > we
> > > > > > > > > > can
> > > > > > > > > > > > > query all replicas and make sure that their states
> > have
> > > > not
> > > > > > > > > > diverged. I
> > > > > > > > > > > > can
> > > > > > > > > > > > > also see it being useful when debugging production
> > > > clusters
> > > > > > > too.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Would it be worth returning
> > > > transactional.id.expiration.ms
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Lucas
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> > > > > > > > rndgstn@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, that definitely sounds reasonable.  Thanks,
> > > Jason!
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ron
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > > > > > > > > jason@confluent.io>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hey Ron,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We do not typically backport new APIs to older
> > > > > versions.
> > > > > > I
> > > > > > > > > think
> > > > > > > > > > we
> > > > > > > > > > > > can
> > > > > > > > > > > > > > > however make the --abort command compatible
> with
> > > > older
> > > > > > > > > versions.
> > > > > > > > > > It
> > > > > > > > > > > > > would
> > > > > > > > > > > > > > > require a user to do some analysis on their own
> > to
> > > > > > > identify a
> > > > > > > > > > > hanging
> > > > > > > > > > > > > > > transaction, but then they can use the tool
> from
> > a
> > > > new
> > > > > > > > release
> > > > > > > > > to
> > > > > > > > > > > > > > recover.
> > > > > > > > > > > > > > > For example, users could detect a hanging
> > > transaction
> > > > > > > through
> > > > > > > > > the
> > > > > > > > > > > > > > existing
> > > > > > > > > > > > > > > "LastStableOffsetLag" metric and then collect
> the
> > > > > needed
> > > > > > > > > > > information
> > > > > > > > > > > > > > from a
> > > > > > > > > > > > > > > dump of the log (or producer snapshot). It's
> more
> > > > work,
> > > > > > but
> > > > > > > > at
> > > > > > > > > > > least
> > > > > > > > > > > > > it's
> > > > > > > > > > > > > > > possible. Does that sound fair?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino
> <
> > > > > > > > > > rndgstn@gmail.com>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Jason.  Thanks for the excellently-written
> > > KIP.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Will the implementation be backported to
> prior
> > > > Kafka
> > > > > > > > > versions?
> > > > > > > > > > > The
> > > > > > > > > > > > > > > reason
> > > > > > > > > > > > > > > > I ask is because if it is not backported and
> > > > similar
> > > > > > > > > > > functionality
> > > > > > > > > > > > is
> > > > > > > > > > > > > > not
> > > > > > > > > > > > > > > > otherwise made available for older versions,
> > then
> > > > the
> > > > > > > only
> > > > > > > > > > > recourse
> > > > > > > > > > > > > > > (aside
> > > > > > > > > > > > > > > > from deleting and recreating the topic as you
> > > > pointed
> > > > > > > out)
> > > > > > > > > may
> > > > > > > > > > be
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > > upgrade to 2.7 (or whatever version ends up
> > > getting
> > > > > > this
> > > > > > > > > > > > > > functionality).
> > > > > > > > > > > > > > > > Such an upgrade may not be desirable,
> > especially
> > > if
> > > > > the
> > > > > > > > > number
> > > > > > > > > > of
> > > > > > > > > > > > > > > > intermediate versions is considerable. I
> > > understand
> > > > > the
> > > > > > > > > mantra
> > > > > > > > > > of
> > > > > > > > > > > > > > "never
> > > > > > > > > > > > > > > > fall too many versions behind" but the
> reality
> > of
> > > > it
> > > > > is
> > > > > > > > that
> > > > > > > > > it
> > > > > > > > > > > > isn't
> > > > > > > > > > > > > > > > always the case.  Even if the version is
> > > relatively
> > > > > > > recent,
> > > > > > > > > an
> > > > > > > > > > > > > upgrade
> > > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > > still not be possible for some time, and a
> > > quicker
> > > > > > > > resolution
> > > > > > > > > > may
> > > > > > > > > > > > be
> > > > > > > > > > > > > > > > necessary.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Ron
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason
> > Gustafson <
> > > > > > > > > > > > jason@confluent.io>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > I've added a proposal to handle the problem
> > of
> > > > > > hanging
> > > > > > > > > > > > > transactions:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > > > > > > > > .
> > > > > > > > > > > > > > > > > In theory, this should never happen. In
> > > practice,
> > > > > we
> > > > > > > have
> > > > > > > > > hit
> > > > > > > > > > > one
> > > > > > > > > > > > > bug
> > > > > > > > > > > > > > > > where
> > > > > > > > > > > > > > > > > it was possible and there are few good
> > options
> > > > > today
> > > > > > to
> > > > > > > > > > > recover.
> > > > > > > > > > > > > > Take a
> > > > > > > > > > > > > > > > > look and let me know what you think.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > -- Guozhang
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Boyang Chen <re...@gmail.com>.
Thanks for the updates Jason. I'm pretty satisfied with the overall
motivation and proposed solution, just a couple of more comments.

1. Why do we need to use type string for `StatesFilter` instead of a short
value, as we could translate it and save space?

2. I'm wondering whether the requirement for Describe permission on
TransactionalId works when we are heading towards
https://issues.apache.org/jira/browse/KAFKA-9454, where we could rely on
consumer group id instead of defining the transactional id. At a first
look, I think it should be ok but just want to raise this point.

3. Could the --find-hanging work with checking all brokers in the cluster,
or multiple brokers as a list?

4. Similar to transaction abortion, I guess there is a trade-off for
too-specific vs too-general for the required number of arguments. However,
supposedly I would like to wipe out all the associated transactions with
the given transactional id, or I want to clean up *all *hanging
transactions in the cluster, do I need to write the script on my own? Maybe
we could discuss a bit on whether we would like to support a more holistic
API, or this is good for now.


On Thu, Sep 10, 2020 at 7:53 AM Tom Bentley <tb...@redhat.com> wrote:

> Sounds good to me, thanks!
>
> On Wed, Sep 9, 2020 at 5:30 PM Jason Gustafson <ja...@confluent.io> wrote:
>
> > Hey Tom,
> >
> > Yeah, that's fair. I will update the proposal. I was also thinking of
> > adding a separate column for duration, just to save users the trouble of
> > computing it.
> >
> > Thanks,
> > Jason
> >
> > On Wed, Sep 9, 2020 at 1:21 AM Tom Bentley <tb...@redhat.com> wrote:
> >
> > > Hi Jason,
> > >
> > > The KIP looks good to me, but I had one question. AFAIU the
> LastTimestamp
> > > column in the output of --describe-producers and --find-hanging is
> there
> > so
> > > the users of the tool know the txnLastUpdateTimestamp of the
> > > TransactionMetadata and from that and the (max) timeout can infer
> > something
> > > about the likelihood that this really is a stuck transaction. If that's
> > the
> > > case then what is the benefit in displaying it as a ms offset from the
> > unix
> > > epoch, rather than an actual date time?
> > >
> > > Thanks,
> > >
> > > Tom
> > >
> > > On Mon, Aug 31, 2020 at 11:28 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Thanks Jason, I do not have more comments on the KIP then.
> > > >
> > > > On Mon, Aug 31, 2020 at 3:19 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > > Hmm, but the "TxnStartOffset" is not included in the
> > > DescribeProducers
> > > > > response either?
> > > > >
> > > > > Oh, I accidentally called it `CurrentTxnStartTimestamp` in the
> > schema.
> > > > > Fixed now!
> > > > >
> > > > > -Jason
> > > > >
> > > > > On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <wa...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <
> > jason@confluent.io
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Guozhang,
> > > > > > >
> > > > > > > Thanks for the detailed comments. Responses inline:
> > > > > > >
> > > > > > > > 1. I'd like to clarify how we can make "--abort" work with
> old
> > > > > brokers,
> > > > > > > since without the additional field "Partitions" the tool needs
> to
> > > set
> > > > > the
> > > > > > > coordinator epoch correctly instead of "-1"? Arguably that's
> > still
> > > > > doable
> > > > > > > but would require different call paths, and it's not clear
> > whether
> > > > > that's
> > > > > > > worth doing for old versions.
> > > > > > >
> > > > > > > That's a good question. What I had in mind was to write the
> > marker
> > > > > using
> > > > > > > the last coordinator epoch that was used by the respective
> > > > ProducerId.
> > > > > I
> > > > > > > realized that I left the coordinator epoch out of the
> > > > > `DescribeProducers`
> > > > > > > response, so I have updated the KIP to include it. It is
> possible
> > > > that
> > > > > > > there is no coordinator epoch associated with a given
> ProducerId
> > > > (e.g.
> > > > > if
> > > > > > > it is the first transaction from that producer), but in this
> case
> > > we
> > > > > can
> > > > > > > use 0.
> > > > > > >
> > > > > > > As for whether this is worth doing, I guess I would be more
> > > inclined
> > > > to
> > > > > > > leave it out if users had a reasonable alternative today to
> > address
> > > > > this
> > > > > > > problem.
> > > > > > >
> > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent
> to
> > > only
> > > > > > > leaders
> > > > > > > while ListTransactions can be sent to any brokers? Or is it
> > really
> > > > > > > "ListTransactions to be sent to coordinators only"? From the
> > > workflow
> > > > > > > you've described, based on the results back from
> > DescribeProducers,
> > > > we
> > > > > > > should just immediately send ListTransactions to the
> > > > > > > corresponding coordinators based on the collected producer ids,
> > > > instead
> > > > > > of
> > > > > > > trying to send to any brokers right?
> > > > > > >
> > > > > > > I'm going to change `DescribeProducers` so that it can be
> handled
> > > by
> > > > > any
> > > > > > > replica of a topic partition. This was suggested by Lucas in
> > order
> > > to
> > > > > > allow
> > > > > > > this API to be used for replica consistency testing. As far as
> > > > > > > `ListTransactions`, I was treating this similarly to
> > `ListGroups`.
> > > > > > Although
> > > > > > > we know that the coordinators are the leaders of the
> > > > > __transaction_state
> > > > > > > partitions, this is more of an implementation detail. From an
> API
> > > > > > > perspective, we say that any broker could be a transaction
> > > > coordinator.
> > > > > > >
> > > > > > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> > > > > permission
> > > > > > > on
> > > > > > > the associated topic sufficient to allow any users to get all
> > > > producer
> > > > > > > information writing to the specific topic-partitions including
> > last
> > > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > > sensitive?
> > > > > > > Should we require "ClusterAction" to only allow operators only?
> > > > > > >
> > > > > > > That's a fair point. Do you think `Read` permission would be
> > > > > reasonable?
> > > > > > > This is all information that could be obtained by reading the
> > > topic.
> > > > > > >
> > > > > > > Yeah that makes sense.
> > > > > >
> > > > > >
> > > > > > > > 4. From the example it seems "TxnStartOffset" should be
> > included
> > > in
> > > > > the
> > > > > > > DescribeTransaction response schema? Otherwise the user would
> not
> > > get
> > > > > it
> > > > > > in
> > > > > > > the following WriteTxnMarker request.
> > > > > > >
> > > > > > > The `DescribeTransaction` API is sent to the transaction
> > > coordinator,
> > > > > > which
> > > > > > > does not know the start offset of a transaction in each topic
> > > > > partition.
> > > > > > > That is why we need `DescribeProducers`.
> > > > > > >
> > > > > >
> > > > > > Hmm, but the "TxnStartOffset" is not included in the
> > > DescribeProducers
> > > > > > response either?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > > 5. It is a bit easier for readers to highlight the added
> fields
> > > in
> > > > > the
> > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are only
> > > adding
> > > > > > > "Partitions" with the starting offset, right?). Also as for its
> > > > > response
> > > > > > it
> > > > > > > seems we do not make any schema changes except adding one more
> > > > > potential
> > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the case
> > we
> > > > can
> > > > > > just
> > > > > > > state that explicitly.
> > > > > > >
> > > > > > > I highlighted the new field in the request. For the response,
> the
> > > KIP
> > > > > > > states the following: "There are no changes to the response
> > schema,
> > > > but
> > > > > > it
> > > > > > > will be bumped. Note that we are also enabling flexible version
> > > > > support."
> > > > > > >
> > > > > > > > 6. It is not clear to me for the overloaded function that the
> > > > > following
> > > > > > > option classes are not specified, what should be the default
> > > options?
> > > > > > > ...
> > > > > > >
> > > > > > > I was just trying to stick with existing conventions, but I
> will
> > > add
> > > > > some
> > > > > > > more detail here. I think we should probably still include
> > > > > > > `AbortTransactionOptions`. The `Options` classes are how users
> > > > override
> > > > > > > timeouts.
> > > > > > >
> > > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> > presume
> > > we
> > > > > > would
> > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > >
> > > > > > > I think it should be required as a reasonable way to limit the
> > > scope
> > > > of
> > > > > > the
> > > > > > > search. This is meant to be guided by metrics after all. If we
> do
> > > not
> > > > > > limit
> > > > > > > the scope to a single broker, then the behavior might get worse
> > as
> > > > the
> > > > > > > cluster grows. I will clarify this.
> > > > > > >
> > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> > feature
> > > > in
> > > > > > the
> > > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > > intentional?
> > > > > > > Personally I feel exposing a "--list-producers" may be useful
> > too:
> > > if
> > > > > we
> > > > > > > believe the user has the right ACL, it is legitimate to return
> > the
> > > > > > producer
> > > > > > > information to her anyways. But that is debatable in the meta
> > point
> > > > 3)
> > > > > > > above.
> > > > > > >
> > > > > > > Yeah, I was planning to add this to support the use case that
> > Lucas
> > > > > > > mentioned. There is some awkwardness since it is a little
> > difficult
> > > > to
> > > > > > > convey different sources of information through the same
> > command. I
> > > > > guess
> > > > > > > we can do `--list producers` and `--list transactions` and
> > explain
> > > in
> > > > > the
> > > > > > > documentation. Maybe that is good enough.
> > > > > > >
> > > > > > > > 7.3 "Describing Transactions": we should also explain how
> that
> > > > would
> > > > > be
> > > > > > > executed, e.g. at least we should clarify that we would first
> > find
> > > > the
> > > > > > > coordinator based on the transactional.id and hence users do
> not
> > > > need
> > > > > to
> > > > > > > specify one.
> > > > > > >
> > > > > > > Sure, makes sense.
> > > > > > >
> > > > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > > > "--broker"
> > > > > > > node
> > > > > > > as a required option? Otherwise we would not know which broker
> to
> > > > send
> > > > > > to.
> > > > > > >
> > > > > > > The --topic and --partition arguments are required, so the
> target
> > > is
> > > > > > always
> > > > > > > the leader of that partition.
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <
> > > > > bob.barrett@confluent.io
> > > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jason,
> > > > > > > >
> > > > > > > > Thanks for this KIP, I think this will be a huge operational
> > > > > > improvement
> > > > > > > > and overall it looks great to me.
> > > > > > > >
> > > > > > > > I'm not sure how much value the MaxActiveTransactionDuration
> > > metric
> > > > > > adds,
> > > > > > > > given that we have the --find-hanging option in the tool. As
> > you
> > > > > > mention,
> > > > > > > > instances of these transactions are expected to be rare, and
> a
> > > > > > > > partition-level metric, which can generate a lot of data,
> seems
> > > > very
> > > > > > > > heavyweight for such a rare occurrence. I think "alert on
> > > > > > > > PartitionsWithLateTransactionsCount" followed by "run
> > > > > > kafka-transactions
> > > > > > > > --find-hanging on the relevant broker" is a reasonable
> process
> > > for
> > > > > > > cluster
> > > > > > > > operators to follow.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Bob
> > > > > > > >
> > > > > > > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <
> > > wangguoz@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > Thanks for the written KIP. I think this is going to be a
> > very
> > > > > useful
> > > > > > > > tool
> > > > > > > > > for operational improvements since with eos in its current
> > > stage,
> > > > > we
> > > > > > > > cannot
> > > > > > > > > confidently assert that we are bug-free, and even in the
> > future
> > > > > when
> > > > > > we
> > > > > > > > are
> > > > > > > > > confident this is still going to be leveraged by older
> > > versioned
> > > > > > > brokers.
> > > > > > > > > Regarding the solution, I've also debated myself whether
> > Kafka
> > > > > should
> > > > > > > > > "self-heal" automatically when detected in such situations,
> > or
> > > > > should
> > > > > > > we
> > > > > > > > > instead build into ecosystem tooling to let operators do
> it.
> > > And
> > > > > I've
> > > > > > > > also
> > > > > > > > > convinced myself that the latter should be a better
> solution
> > to
> > > > > keep
> > > > > > > > Kafka
> > > > > > > > > software itself simpler.
> > > > > > > > >
> > > > > > > > > Regarding the KIP itself, I have a few meta comments below:
> > > > > > > > >
> > > > > > > > > 1. I'd like to clarify how we can make "--abort" work with
> > old
> > > > > > brokers,
> > > > > > > > > since without the additional field "Partitions" the tool
> > needs
> > > to
> > > > > set
> > > > > > > the
> > > > > > > > > coordinator epoch correctly instead of "-1"? Arguably
> that's
> > > > still
> > > > > > > doable
> > > > > > > > > but would require different call paths, and it's not clear
> > > > whether
> > > > > > > that's
> > > > > > > > > worth doing for old versions.
> > > > > > > > >
> > > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent
> > to
> > > > only
> > > > > > > > leaders
> > > > > > > > > while ListTransactions can be sent to any brokers? Or is it
> > > > really
> > > > > > > > > "ListTransactions to be sent to coordinators only"? From
> the
> > > > > workflow
> > > > > > > > > you've described, based on the results back from
> > > > DescribeProducers,
> > > > > > we
> > > > > > > > > should just immediately send ListTransactions to the
> > > > > > > > > corresponding coordinators based on the collected producer
> > ids,
> > > > > > instead
> > > > > > > > of
> > > > > > > > > trying to send to any brokers right?
> > > > > > > > >
> > > > > > > > > Also I'm a bit concerned if "ListTransactions" could
> > > potentially
> > > > > > return
> > > > > > > > too
> > > > > > > > > much data with "StateFilters" set to all states, including
> > > > > completed
> > > > > > > > ones.
> > > > > > > > > Do we expect users ever want to know transactions that are
> > not
> > > > > > pending?
> > > > > > > > On
> > > > > > > > > the other hand, maybe we can just require users to specify
> > the
> > > > > > "pids[]"
> > > > > > > > in
> > > > > > > > > this request too to further filter those un-interested
> > > > > transactions.
> > > > > > > This
> > > > > > > > > also works well with the workflow: we know exactly from
> > > > > > > > "DescribeProducers"
> > > > > > > > > which pids are we diagnosing right now, so in the follow-up
> > > > > > > > > "ListTransactions" we should also only care for those
> > > partitions
> > > > > > only.
> > > > > > > > >
> > > > > > > > > 3. One thing I'm a bit hesitant about is that, is
> `Describe`
> > > > > > permission
> > > > > > > > on
> > > > > > > > > the associated topic sufficient to allow any users to get
> all
> > > > > > producer
> > > > > > > > > information writing to the specific topic-partitions
> > including
> > > > last
> > > > > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > > > > sensitive?
> > > > > > > > > Should we require "ClusterAction" to only allow operators
> > only?
> > > > > > > > >
> > > > > > > > > Below are more detailed comments:
> > > > > > > > >
> > > > > > > > > 4. From the example it seems "TxnStartOffset" should be
> > > included
> > > > in
> > > > > > the
> > > > > > > > > DescribeTransaction response schema? Otherwise the user
> would
> > > not
> > > > > get
> > > > > > > it
> > > > > > > > in
> > > > > > > > > the following WriteTxnMarker request.
> > > > > > > > >
> > > > > > > > > 5. It is a bit easier for readers to highlight the added
> > fields
> > > > in
> > > > > > the
> > > > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are
> > only
> > > > > adding
> > > > > > > > > "Partitions" with the starting offset, right?). Also as for
> > its
> > > > > > > response
> > > > > > > > it
> > > > > > > > > seems we do not make any schema changes except adding one
> > more
> > > > > > > potential
> > > > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the
> > case
> > > > we
> > > > > > can
> > > > > > > > just
> > > > > > > > > state that explicitly.
> > > > > > > > >
> > > > > > > > > 6. It is not clear to me for the overloaded function that
> the
> > > > > > following
> > > > > > > > > option classes are not specified, what should be the
> default
> > > > > options?
> > > > > > > > >
> > > > > > > > > * ListTransactionsOptions
> > > > > > > > > * DescribeTransactionsOptions
> > > > > > > > > * DescribeProducersOptions
> > > > > > > > >
> > > > > > > > > Also, it seems AbortTransactionOptions would just be empty?
> > If
> > > > yes
> > > > > do
> > > > > > > we
> > > > > > > > > really need this option class for now?
> > > > > > > > >
> > > > > > > > > 7. A couple questions from the cmd tool examples:
> > > > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> > > presume
> > > > we
> > > > > > > would
> > > > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> > > feature
> > > > > in
> > > > > > > the
> > > > > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > > > > intentional?
> > > > > > > > > Personally I feel exposing a "--list-producers" may be
> useful
> > > > too:
> > > > > if
> > > > > > > we
> > > > > > > > > believe the user has the right ACL, it is legitimate to
> > return
> > > > the
> > > > > > > > producer
> > > > > > > > > information to her anyways. But that is debatable in the
> meta
> > > > point
> > > > > > 3)
> > > > > > > > > above.
> > > > > > > > > 7.3 "Describing Transactions": we should also explain how
> > that
> > > > > would
> > > > > > be
> > > > > > > > > executed, e.g. at least we should clarify that we would
> first
> > > > find
> > > > > > the
> > > > > > > > > coordinator based on the transactional.id and hence users
> do
> > > not
> > > > > > need
> > > > > > > to
> > > > > > > > > specify one.
> > > > > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > > > > "--broker"
> > > > > > > > node
> > > > > > > > > as a required option? Otherwise we would not know which
> > broker
> > > to
> > > > > > send
> > > > > > > > to.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Overall, nice written one, thanks Jason.
> > > > > > > > >
> > > > > > > > > Guozhang
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <
> > > > > > lucas@confluent.io>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > >> Would it be worth returning
> > > transactional.id.expiration.ms
> > > > in
> > > > > > the
> > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > >
> > > > > > > > > > > That's an interesting thought as well. Are you trying
> to
> > > > avoid
> > > > > > the
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > specify it through the command line? The tool could also
> > > query
> > > > > the
> > > > > > > > value
> > > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > > >
> > > > > > > > > > Basically. I'm not sure how useful this will be in
> > practice,
> > > > > though
> > > > > > > it
> > > > > > > > > > might help when debugging.
> > > > > > > > > >
> > > > > > > > > > Lucas
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <
> > > > > > jason@confluent.io
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Lucas,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the comments. Responses below:
> > > > > > > > > > >
> > > > > > > > > > > > Given that it's possible for replica producer states
> to
> > > > > diverge
> > > > > > > > from
> > > > > > > > > > each
> > > > > > > > > > > other, it would be very useful if
> > > > > > > DescribeProducers(Request,Response)
> > > > > > > > > and
> > > > > > > > > > > tooling is able to query all partition replicas for
> their
> > > > > > producers
> > > > > > > > > > >
> > > > > > > > > > > Yes, it makes sense to me to let DescribeProducers work
> > on
> > > > both
> > > > > > > > > followers
> > > > > > > > > > > and leaders. In fact, I'm encouraged that there are use
> > > cases
> > > > > for
> > > > > > > > this
> > > > > > > > > > work
> > > > > > > > > > > other than detecting hanging transactions. That was
> > indeed
> > > > the
> > > > > > > hope,
> > > > > > > > > but
> > > > > > > > > > I
> > > > > > > > > > > didn't have anything specific in mind. I will update
> the
> > > > > > proposal.
> > > > > > > > > > >
> > > > > > > > > > > > Would it be worth returning
> > > transactional.id.expiration.ms
> > > > > in
> > > > > > > the
> > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > >
> > > > > > > > > > > That's an interesting thought as well. Are you trying
> to
> > > > avoid
> > > > > > the
> > > > > > > > need
> > > > > > > > > > to
> > > > > > > > > > > specify it through the command line? The tool could
> also
> > > > query
> > > > > > the
> > > > > > > > > value
> > > > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Jason
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > > > > > > > lucas@confluent.io>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Jason,
> > > > > > > > > > > >
> > > > > > > > > > > > This looks like a very useful tool, thanks for
> writing
> > it
> > > > up.
> > > > > > > > > > > >
> > > > > > > > > > > > Given that it's possible for replica producer states
> to
> > > > > diverge
> > > > > > > > from
> > > > > > > > > > each
> > > > > > > > > > > > other, it would be very useful if
> > > > > > > > DescribeProducers(Request,Response)
> > > > > > > > > > and
> > > > > > > > > > > > tooling is able to query all partition replicas for
> > their
> > > > > > > > producers.
> > > > > > > > > > One
> > > > > > > > > > > > way I can see this being used immediately is in
> kafka's
> > > > > system
> > > > > > > > tests,
> > > > > > > > > > > > especially the ones that inject failures. At the end
> of
> > > the
> > > > > > test
> > > > > > > we
> > > > > > > > > can
> > > > > > > > > > > > query all replicas and make sure that their states
> have
> > > not
> > > > > > > > > diverged. I
> > > > > > > > > > > can
> > > > > > > > > > > > also see it being useful when debugging production
> > > clusters
> > > > > > too.
> > > > > > > > > > > >
> > > > > > > > > > > > Would it be worth returning
> > > transactional.id.expiration.ms
> > > > > in
> > > > > > > the
> > > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > >
> > > > > > > > > > > > Lucas
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> > > > > > > rndgstn@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Yes, that definitely sounds reasonable.  Thanks,
> > Jason!
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ron
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > > > > > > > jason@confluent.io>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hey Ron,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > We do not typically backport new APIs to older
> > > > versions.
> > > > > I
> > > > > > > > think
> > > > > > > > > we
> > > > > > > > > > > can
> > > > > > > > > > > > > > however make the --abort command compatible with
> > > older
> > > > > > > > versions.
> > > > > > > > > It
> > > > > > > > > > > > would
> > > > > > > > > > > > > > require a user to do some analysis on their own
> to
> > > > > > identify a
> > > > > > > > > > hanging
> > > > > > > > > > > > > > transaction, but then they can use the tool from
> a
> > > new
> > > > > > > release
> > > > > > > > to
> > > > > > > > > > > > > recover.
> > > > > > > > > > > > > > For example, users could detect a hanging
> > transaction
> > > > > > through
> > > > > > > > the
> > > > > > > > > > > > > existing
> > > > > > > > > > > > > > "LastStableOffsetLag" metric and then collect the
> > > > needed
> > > > > > > > > > information
> > > > > > > > > > > > > from a
> > > > > > > > > > > > > > dump of the log (or producer snapshot). It's more
> > > work,
> > > > > but
> > > > > > > at
> > > > > > > > > > least
> > > > > > > > > > > > it's
> > > > > > > > > > > > > > possible. Does that sound fair?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> > > > > > > > > rndgstn@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Jason.  Thanks for the excellently-written
> > KIP.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Will the implementation be backported to prior
> > > Kafka
> > > > > > > > versions?
> > > > > > > > > > The
> > > > > > > > > > > > > > reason
> > > > > > > > > > > > > > > I ask is because if it is not backported and
> > > similar
> > > > > > > > > > functionality
> > > > > > > > > > > is
> > > > > > > > > > > > > not
> > > > > > > > > > > > > > > otherwise made available for older versions,
> then
> > > the
> > > > > > only
> > > > > > > > > > recourse
> > > > > > > > > > > > > > (aside
> > > > > > > > > > > > > > > from deleting and recreating the topic as you
> > > pointed
> > > > > > out)
> > > > > > > > may
> > > > > > > > > be
> > > > > > > > > > > to
> > > > > > > > > > > > > > > upgrade to 2.7 (or whatever version ends up
> > getting
> > > > > this
> > > > > > > > > > > > > functionality).
> > > > > > > > > > > > > > > Such an upgrade may not be desirable,
> especially
> > if
> > > > the
> > > > > > > > number
> > > > > > > > > of
> > > > > > > > > > > > > > > intermediate versions is considerable. I
> > understand
> > > > the
> > > > > > > > mantra
> > > > > > > > > of
> > > > > > > > > > > > > "never
> > > > > > > > > > > > > > > fall too many versions behind" but the reality
> of
> > > it
> > > > is
> > > > > > > that
> > > > > > > > it
> > > > > > > > > > > isn't
> > > > > > > > > > > > > > > always the case.  Even if the version is
> > relatively
> > > > > > recent,
> > > > > > > > an
> > > > > > > > > > > > upgrade
> > > > > > > > > > > > > > may
> > > > > > > > > > > > > > > still not be possible for some time, and a
> > quicker
> > > > > > > resolution
> > > > > > > > > may
> > > > > > > > > > > be
> > > > > > > > > > > > > > > necessary.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Ron
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason
> Gustafson <
> > > > > > > > > > > jason@confluent.io>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I've added a proposal to handle the problem
> of
> > > > > hanging
> > > > > > > > > > > > transactions:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > > > > > > > .
> > > > > > > > > > > > > > > > In theory, this should never happen. In
> > practice,
> > > > we
> > > > > > have
> > > > > > > > hit
> > > > > > > > > > one
> > > > > > > > > > > > bug
> > > > > > > > > > > > > > > where
> > > > > > > > > > > > > > > > it was possible and there are few good
> options
> > > > today
> > > > > to
> > > > > > > > > > recover.
> > > > > > > > > > > > > Take a
> > > > > > > > > > > > > > > > look and let me know what you think.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > -- Guozhang
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Tom Bentley <tb...@redhat.com>.
Sounds good to me, thanks!

On Wed, Sep 9, 2020 at 5:30 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Tom,
>
> Yeah, that's fair. I will update the proposal. I was also thinking of
> adding a separate column for duration, just to save users the trouble of
> computing it.
>
> Thanks,
> Jason
>
> On Wed, Sep 9, 2020 at 1:21 AM Tom Bentley <tb...@redhat.com> wrote:
>
> > Hi Jason,
> >
> > The KIP looks good to me, but I had one question. AFAIU the LastTimestamp
> > column in the output of --describe-producers and --find-hanging is there
> so
> > the users of the tool know the txnLastUpdateTimestamp of the
> > TransactionMetadata and from that and the (max) timeout can infer
> something
> > about the likelihood that this really is a stuck transaction. If that's
> the
> > case then what is the benefit in displaying it as a ms offset from the
> unix
> > epoch, rather than an actual date time?
> >
> > Thanks,
> >
> > Tom
> >
> > On Mon, Aug 31, 2020 at 11:28 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Thanks Jason, I do not have more comments on the KIP then.
> > >
> > > On Mon, Aug 31, 2020 at 3:19 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > > Hmm, but the "TxnStartOffset" is not included in the
> > DescribeProducers
> > > > response either?
> > > >
> > > > Oh, I accidentally called it `CurrentTxnStartTimestamp` in the
> schema.
> > > > Fixed now!
> > > >
> > > > -Jason
> > > >
> > > > On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <
> jason@confluent.io
> > >
> > > > > wrote:
> > > > >
> > > > > > Hey Guozhang,
> > > > > >
> > > > > > Thanks for the detailed comments. Responses inline:
> > > > > >
> > > > > > > 1. I'd like to clarify how we can make "--abort" work with old
> > > > brokers,
> > > > > > since without the additional field "Partitions" the tool needs to
> > set
> > > > the
> > > > > > coordinator epoch correctly instead of "-1"? Arguably that's
> still
> > > > doable
> > > > > > but would require different call paths, and it's not clear
> whether
> > > > that's
> > > > > > worth doing for old versions.
> > > > > >
> > > > > > That's a good question. What I had in mind was to write the
> marker
> > > > using
> > > > > > the last coordinator epoch that was used by the respective
> > > ProducerId.
> > > > I
> > > > > > realized that I left the coordinator epoch out of the
> > > > `DescribeProducers`
> > > > > > response, so I have updated the KIP to include it. It is possible
> > > that
> > > > > > there is no coordinator epoch associated with a given ProducerId
> > > (e.g.
> > > > if
> > > > > > it is the first transaction from that producer), but in this case
> > we
> > > > can
> > > > > > use 0.
> > > > > >
> > > > > > As for whether this is worth doing, I guess I would be more
> > inclined
> > > to
> > > > > > leave it out if users had a reasonable alternative today to
> address
> > > > this
> > > > > > problem.
> > > > > >
> > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent to
> > only
> > > > > > leaders
> > > > > > while ListTransactions can be sent to any brokers? Or is it
> really
> > > > > > "ListTransactions to be sent to coordinators only"? From the
> > workflow
> > > > > > you've described, based on the results back from
> DescribeProducers,
> > > we
> > > > > > should just immediately send ListTransactions to the
> > > > > > corresponding coordinators based on the collected producer ids,
> > > instead
> > > > > of
> > > > > > trying to send to any brokers right?
> > > > > >
> > > > > > I'm going to change `DescribeProducers` so that it can be handled
> > by
> > > > any
> > > > > > replica of a topic partition. This was suggested by Lucas in
> order
> > to
> > > > > allow
> > > > > > this API to be used for replica consistency testing. As far as
> > > > > > `ListTransactions`, I was treating this similarly to
> `ListGroups`.
> > > > > Although
> > > > > > we know that the coordinators are the leaders of the
> > > > __transaction_state
> > > > > > partitions, this is more of an implementation detail. From an API
> > > > > > perspective, we say that any broker could be a transaction
> > > coordinator.
> > > > > >
> > > > > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> > > > permission
> > > > > > on
> > > > > > the associated topic sufficient to allow any users to get all
> > > producer
> > > > > > information writing to the specific topic-partitions including
> last
> > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > sensitive?
> > > > > > Should we require "ClusterAction" to only allow operators only?
> > > > > >
> > > > > > That's a fair point. Do you think `Read` permission would be
> > > > reasonable?
> > > > > > This is all information that could be obtained by reading the
> > topic.
> > > > > >
> > > > > > Yeah that makes sense.
> > > > >
> > > > >
> > > > > > > 4. From the example it seems "TxnStartOffset" should be
> included
> > in
> > > > the
> > > > > > DescribeTransaction response schema? Otherwise the user would not
> > get
> > > > it
> > > > > in
> > > > > > the following WriteTxnMarker request.
> > > > > >
> > > > > > The `DescribeTransaction` API is sent to the transaction
> > coordinator,
> > > > > which
> > > > > > does not know the start offset of a transaction in each topic
> > > > partition.
> > > > > > That is why we need `DescribeProducers`.
> > > > > >
> > > > >
> > > > > Hmm, but the "TxnStartOffset" is not included in the
> > DescribeProducers
> > > > > response either?
> > > > >
> > > > >
> > > > > >
> > > > > > > 5. It is a bit easier for readers to highlight the added fields
> > in
> > > > the
> > > > > > existing WriteTxnMarkerRequest (btw I read is that we are only
> > adding
> > > > > > "Partitions" with the starting offset, right?). Also as for its
> > > > response
> > > > > it
> > > > > > seems we do not make any schema changes except adding one more
> > > > potential
> > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the case
> we
> > > can
> > > > > just
> > > > > > state that explicitly.
> > > > > >
> > > > > > I highlighted the new field in the request. For the response, the
> > KIP
> > > > > > states the following: "There are no changes to the response
> schema,
> > > but
> > > > > it
> > > > > > will be bumped. Note that we are also enabling flexible version
> > > > support."
> > > > > >
> > > > > > > 6. It is not clear to me for the overloaded function that the
> > > > following
> > > > > > option classes are not specified, what should be the default
> > options?
> > > > > > ...
> > > > > >
> > > > > > I was just trying to stick with existing conventions, but I will
> > add
> > > > some
> > > > > > more detail here. I think we should probably still include
> > > > > > `AbortTransactionOptions`. The `Options` classes are how users
> > > override
> > > > > > timeouts.
> > > > > >
> > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> presume
> > we
> > > > > would
> > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > >
> > > > > > I think it should be required as a reasonable way to limit the
> > scope
> > > of
> > > > > the
> > > > > > search. This is meant to be guided by metrics after all. If we do
> > not
> > > > > limit
> > > > > > the scope to a single broker, then the behavior might get worse
> as
> > > the
> > > > > > cluster grows. I will clarify this.
> > > > > >
> > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> feature
> > > in
> > > > > the
> > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > intentional?
> > > > > > Personally I feel exposing a "--list-producers" may be useful
> too:
> > if
> > > > we
> > > > > > believe the user has the right ACL, it is legitimate to return
> the
> > > > > producer
> > > > > > information to her anyways. But that is debatable in the meta
> point
> > > 3)
> > > > > > above.
> > > > > >
> > > > > > Yeah, I was planning to add this to support the use case that
> Lucas
> > > > > > mentioned. There is some awkwardness since it is a little
> difficult
> > > to
> > > > > > convey different sources of information through the same
> command. I
> > > > guess
> > > > > > we can do `--list producers` and `--list transactions` and
> explain
> > in
> > > > the
> > > > > > documentation. Maybe that is good enough.
> > > > > >
> > > > > > > 7.3 "Describing Transactions": we should also explain how that
> > > would
> > > > be
> > > > > > executed, e.g. at least we should clarify that we would first
> find
> > > the
> > > > > > coordinator based on the transactional.id and hence users do not
> > > need
> > > > to
> > > > > > specify one.
> > > > > >
> > > > > > Sure, makes sense.
> > > > > >
> > > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > > "--broker"
> > > > > > node
> > > > > > as a required option? Otherwise we would not know which broker to
> > > send
> > > > > to.
> > > > > >
> > > > > > The --topic and --partition arguments are required, so the target
> > is
> > > > > always
> > > > > > the leader of that partition.
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <
> > > > bob.barrett@confluent.io
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > Thanks for this KIP, I think this will be a huge operational
> > > > > improvement
> > > > > > > and overall it looks great to me.
> > > > > > >
> > > > > > > I'm not sure how much value the MaxActiveTransactionDuration
> > metric
> > > > > adds,
> > > > > > > given that we have the --find-hanging option in the tool. As
> you
> > > > > mention,
> > > > > > > instances of these transactions are expected to be rare, and a
> > > > > > > partition-level metric, which can generate a lot of data, seems
> > > very
> > > > > > > heavyweight for such a rare occurrence. I think "alert on
> > > > > > > PartitionsWithLateTransactionsCount" followed by "run
> > > > > kafka-transactions
> > > > > > > --find-hanging on the relevant broker" is a reasonable process
> > for
> > > > > > cluster
> > > > > > > operators to follow.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Bob
> > > > > > >
> > > > > > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <
> > wangguoz@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jason,
> > > > > > > >
> > > > > > > > Thanks for the written KIP. I think this is going to be a
> very
> > > > useful
> > > > > > > tool
> > > > > > > > for operational improvements since with eos in its current
> > stage,
> > > > we
> > > > > > > cannot
> > > > > > > > confidently assert that we are bug-free, and even in the
> future
> > > > when
> > > > > we
> > > > > > > are
> > > > > > > > confident this is still going to be leveraged by older
> > versioned
> > > > > > brokers.
> > > > > > > > Regarding the solution, I've also debated myself whether
> Kafka
> > > > should
> > > > > > > > "self-heal" automatically when detected in such situations,
> or
> > > > should
> > > > > > we
> > > > > > > > instead build into ecosystem tooling to let operators do it.
> > And
> > > > I've
> > > > > > > also
> > > > > > > > convinced myself that the latter should be a better solution
> to
> > > > keep
> > > > > > > Kafka
> > > > > > > > software itself simpler.
> > > > > > > >
> > > > > > > > Regarding the KIP itself, I have a few meta comments below:
> > > > > > > >
> > > > > > > > 1. I'd like to clarify how we can make "--abort" work with
> old
> > > > > brokers,
> > > > > > > > since without the additional field "Partitions" the tool
> needs
> > to
> > > > set
> > > > > > the
> > > > > > > > coordinator epoch correctly instead of "-1"? Arguably that's
> > > still
> > > > > > doable
> > > > > > > > but would require different call paths, and it's not clear
> > > whether
> > > > > > that's
> > > > > > > > worth doing for old versions.
> > > > > > > >
> > > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent
> to
> > > only
> > > > > > > leaders
> > > > > > > > while ListTransactions can be sent to any brokers? Or is it
> > > really
> > > > > > > > "ListTransactions to be sent to coordinators only"? From the
> > > > workflow
> > > > > > > > you've described, based on the results back from
> > > DescribeProducers,
> > > > > we
> > > > > > > > should just immediately send ListTransactions to the
> > > > > > > > corresponding coordinators based on the collected producer
> ids,
> > > > > instead
> > > > > > > of
> > > > > > > > trying to send to any brokers right?
> > > > > > > >
> > > > > > > > Also I'm a bit concerned if "ListTransactions" could
> > potentially
> > > > > return
> > > > > > > too
> > > > > > > > much data with "StateFilters" set to all states, including
> > > > completed
> > > > > > > ones.
> > > > > > > > Do we expect users ever want to know transactions that are
> not
> > > > > pending?
> > > > > > > On
> > > > > > > > the other hand, maybe we can just require users to specify
> the
> > > > > "pids[]"
> > > > > > > in
> > > > > > > > this request too to further filter those un-interested
> > > > transactions.
> > > > > > This
> > > > > > > > also works well with the workflow: we know exactly from
> > > > > > > "DescribeProducers"
> > > > > > > > which pids are we diagnosing right now, so in the follow-up
> > > > > > > > "ListTransactions" we should also only care for those
> > partitions
> > > > > only.
> > > > > > > >
> > > > > > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> > > > > permission
> > > > > > > on
> > > > > > > > the associated topic sufficient to allow any users to get all
> > > > > producer
> > > > > > > > information writing to the specific topic-partitions
> including
> > > last
> > > > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > > > sensitive?
> > > > > > > > Should we require "ClusterAction" to only allow operators
> only?
> > > > > > > >
> > > > > > > > Below are more detailed comments:
> > > > > > > >
> > > > > > > > 4. From the example it seems "TxnStartOffset" should be
> > included
> > > in
> > > > > the
> > > > > > > > DescribeTransaction response schema? Otherwise the user would
> > not
> > > > get
> > > > > > it
> > > > > > > in
> > > > > > > > the following WriteTxnMarker request.
> > > > > > > >
> > > > > > > > 5. It is a bit easier for readers to highlight the added
> fields
> > > in
> > > > > the
> > > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are
> only
> > > > adding
> > > > > > > > "Partitions" with the starting offset, right?). Also as for
> its
> > > > > > response
> > > > > > > it
> > > > > > > > seems we do not make any schema changes except adding one
> more
> > > > > > potential
> > > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the
> case
> > > we
> > > > > can
> > > > > > > just
> > > > > > > > state that explicitly.
> > > > > > > >
> > > > > > > > 6. It is not clear to me for the overloaded function that the
> > > > > following
> > > > > > > > option classes are not specified, what should be the default
> > > > options?
> > > > > > > >
> > > > > > > > * ListTransactionsOptions
> > > > > > > > * DescribeTransactionsOptions
> > > > > > > > * DescribeProducersOptions
> > > > > > > >
> > > > > > > > Also, it seems AbortTransactionOptions would just be empty?
> If
> > > yes
> > > > do
> > > > > > we
> > > > > > > > really need this option class for now?
> > > > > > > >
> > > > > > > > 7. A couple questions from the cmd tool examples:
> > > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> > presume
> > > we
> > > > > > would
> > > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> > feature
> > > > in
> > > > > > the
> > > > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > > > intentional?
> > > > > > > > Personally I feel exposing a "--list-producers" may be useful
> > > too:
> > > > if
> > > > > > we
> > > > > > > > believe the user has the right ACL, it is legitimate to
> return
> > > the
> > > > > > > producer
> > > > > > > > information to her anyways. But that is debatable in the meta
> > > point
> > > > > 3)
> > > > > > > > above.
> > > > > > > > 7.3 "Describing Transactions": we should also explain how
> that
> > > > would
> > > > > be
> > > > > > > > executed, e.g. at least we should clarify that we would first
> > > find
> > > > > the
> > > > > > > > coordinator based on the transactional.id and hence users do
> > not
> > > > > need
> > > > > > to
> > > > > > > > specify one.
> > > > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > > > "--broker"
> > > > > > > node
> > > > > > > > as a required option? Otherwise we would not know which
> broker
> > to
> > > > > send
> > > > > > > to.
> > > > > > > >
> > > > > > > >
> > > > > > > > Overall, nice written one, thanks Jason.
> > > > > > > >
> > > > > > > > Guozhang
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <
> > > > > lucas@confluent.io>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > >> Would it be worth returning
> > transactional.id.expiration.ms
> > > in
> > > > > the
> > > > > > > > > DescribeProducersResponse?
> > > > > > > > >
> > > > > > > > > > That's an interesting thought as well. Are you trying to
> > > avoid
> > > > > the
> > > > > > > need
> > > > > > > > > to
> > > > > > > > > specify it through the command line? The tool could also
> > query
> > > > the
> > > > > > > value
> > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > >
> > > > > > > > > Basically. I'm not sure how useful this will be in
> practice,
> > > > though
> > > > > > it
> > > > > > > > > might help when debugging.
> > > > > > > > >
> > > > > > > > > Lucas
> > > > > > > > >
> > > > > > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <
> > > > > jason@confluent.io
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey Lucas,
> > > > > > > > > >
> > > > > > > > > > Thanks for the comments. Responses below:
> > > > > > > > > >
> > > > > > > > > > > Given that it's possible for replica producer states to
> > > > diverge
> > > > > > > from
> > > > > > > > > each
> > > > > > > > > > other, it would be very useful if
> > > > > > DescribeProducers(Request,Response)
> > > > > > > > and
> > > > > > > > > > tooling is able to query all partition replicas for their
> > > > > producers
> > > > > > > > > >
> > > > > > > > > > Yes, it makes sense to me to let DescribeProducers work
> on
> > > both
> > > > > > > > followers
> > > > > > > > > > and leaders. In fact, I'm encouraged that there are use
> > cases
> > > > for
> > > > > > > this
> > > > > > > > > work
> > > > > > > > > > other than detecting hanging transactions. That was
> indeed
> > > the
> > > > > > hope,
> > > > > > > > but
> > > > > > > > > I
> > > > > > > > > > didn't have anything specific in mind. I will update the
> > > > > proposal.
> > > > > > > > > >
> > > > > > > > > > > Would it be worth returning
> > transactional.id.expiration.ms
> > > > in
> > > > > > the
> > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > >
> > > > > > > > > > That's an interesting thought as well. Are you trying to
> > > avoid
> > > > > the
> > > > > > > need
> > > > > > > > > to
> > > > > > > > > > specify it through the command line? The tool could also
> > > query
> > > > > the
> > > > > > > > value
> > > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Jason
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > > > > > > lucas@confluent.io>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Jason,
> > > > > > > > > > >
> > > > > > > > > > > This looks like a very useful tool, thanks for writing
> it
> > > up.
> > > > > > > > > > >
> > > > > > > > > > > Given that it's possible for replica producer states to
> > > > diverge
> > > > > > > from
> > > > > > > > > each
> > > > > > > > > > > other, it would be very useful if
> > > > > > > DescribeProducers(Request,Response)
> > > > > > > > > and
> > > > > > > > > > > tooling is able to query all partition replicas for
> their
> > > > > > > producers.
> > > > > > > > > One
> > > > > > > > > > > way I can see this being used immediately is in kafka's
> > > > system
> > > > > > > tests,
> > > > > > > > > > > especially the ones that inject failures. At the end of
> > the
> > > > > test
> > > > > > we
> > > > > > > > can
> > > > > > > > > > > query all replicas and make sure that their states have
> > not
> > > > > > > > diverged. I
> > > > > > > > > > can
> > > > > > > > > > > also see it being useful when debugging production
> > clusters
> > > > > too.
> > > > > > > > > > >
> > > > > > > > > > > Would it be worth returning
> > transactional.id.expiration.ms
> > > > in
> > > > > > the
> > > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > >
> > > > > > > > > > > Lucas
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> > > > > > rndgstn@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Yes, that definitely sounds reasonable.  Thanks,
> Jason!
> > > > > > > > > > > >
> > > > > > > > > > > > Ron
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > > > > > > jason@confluent.io>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hey Ron,
> > > > > > > > > > > > >
> > > > > > > > > > > > > We do not typically backport new APIs to older
> > > versions.
> > > > I
> > > > > > > think
> > > > > > > > we
> > > > > > > > > > can
> > > > > > > > > > > > > however make the --abort command compatible with
> > older
> > > > > > > versions.
> > > > > > > > It
> > > > > > > > > > > would
> > > > > > > > > > > > > require a user to do some analysis on their own to
> > > > > identify a
> > > > > > > > > hanging
> > > > > > > > > > > > > transaction, but then they can use the tool from a
> > new
> > > > > > release
> > > > > > > to
> > > > > > > > > > > > recover.
> > > > > > > > > > > > > For example, users could detect a hanging
> transaction
> > > > > through
> > > > > > > the
> > > > > > > > > > > > existing
> > > > > > > > > > > > > "LastStableOffsetLag" metric and then collect the
> > > needed
> > > > > > > > > information
> > > > > > > > > > > > from a
> > > > > > > > > > > > > dump of the log (or producer snapshot). It's more
> > work,
> > > > but
> > > > > > at
> > > > > > > > > least
> > > > > > > > > > > it's
> > > > > > > > > > > > > possible. Does that sound fair?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Jason
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> > > > > > > > rndgstn@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Jason.  Thanks for the excellently-written
> KIP.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Will the implementation be backported to prior
> > Kafka
> > > > > > > versions?
> > > > > > > > > The
> > > > > > > > > > > > > reason
> > > > > > > > > > > > > > I ask is because if it is not backported and
> > similar
> > > > > > > > > functionality
> > > > > > > > > > is
> > > > > > > > > > > > not
> > > > > > > > > > > > > > otherwise made available for older versions, then
> > the
> > > > > only
> > > > > > > > > recourse
> > > > > > > > > > > > > (aside
> > > > > > > > > > > > > > from deleting and recreating the topic as you
> > pointed
> > > > > out)
> > > > > > > may
> > > > > > > > be
> > > > > > > > > > to
> > > > > > > > > > > > > > upgrade to 2.7 (or whatever version ends up
> getting
> > > > this
> > > > > > > > > > > > functionality).
> > > > > > > > > > > > > > Such an upgrade may not be desirable, especially
> if
> > > the
> > > > > > > number
> > > > > > > > of
> > > > > > > > > > > > > > intermediate versions is considerable. I
> understand
> > > the
> > > > > > > mantra
> > > > > > > > of
> > > > > > > > > > > > "never
> > > > > > > > > > > > > > fall too many versions behind" but the reality of
> > it
> > > is
> > > > > > that
> > > > > > > it
> > > > > > > > > > isn't
> > > > > > > > > > > > > > always the case.  Even if the version is
> relatively
> > > > > recent,
> > > > > > > an
> > > > > > > > > > > upgrade
> > > > > > > > > > > > > may
> > > > > > > > > > > > > > still not be possible for some time, and a
> quicker
> > > > > > resolution
> > > > > > > > may
> > > > > > > > > > be
> > > > > > > > > > > > > > necessary.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ron
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > > > > > > > > jason@confluent.io>
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I've added a proposal to handle the problem of
> > > > hanging
> > > > > > > > > > > transactions:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > > > > > > .
> > > > > > > > > > > > > > > In theory, this should never happen. In
> practice,
> > > we
> > > > > have
> > > > > > > hit
> > > > > > > > > one
> > > > > > > > > > > bug
> > > > > > > > > > > > > > where
> > > > > > > > > > > > > > > it was possible and there are few good options
> > > today
> > > > to
> > > > > > > > > recover.
> > > > > > > > > > > > Take a
> > > > > > > > > > > > > > > look and let me know what you think.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > -- Guozhang
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Jason Gustafson <ja...@confluent.io>.
Hey Tom,

Yeah, that's fair. I will update the proposal. I was also thinking of
adding a separate column for duration, just to save users the trouble of
computing it.

Thanks,
Jason

On Wed, Sep 9, 2020 at 1:21 AM Tom Bentley <tb...@redhat.com> wrote:

> Hi Jason,
>
> The KIP looks good to me, but I had one question. AFAIU the LastTimestamp
> column in the output of --describe-producers and --find-hanging is there so
> the users of the tool know the txnLastUpdateTimestamp of the
> TransactionMetadata and from that and the (max) timeout can infer something
> about the likelihood that this really is a stuck transaction. If that's the
> case then what is the benefit in displaying it as a ms offset from the unix
> epoch, rather than an actual date time?
>
> Thanks,
>
> Tom
>
> On Mon, Aug 31, 2020 at 11:28 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Thanks Jason, I do not have more comments on the KIP then.
> >
> > On Mon, Aug 31, 2020 at 3:19 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > > Hmm, but the "TxnStartOffset" is not included in the
> DescribeProducers
> > > response either?
> > >
> > > Oh, I accidentally called it `CurrentTxnStartTimestamp` in the schema.
> > > Fixed now!
> > >
> > > -Jason
> > >
> > > On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <jason@confluent.io
> >
> > > > wrote:
> > > >
> > > > > Hey Guozhang,
> > > > >
> > > > > Thanks for the detailed comments. Responses inline:
> > > > >
> > > > > > 1. I'd like to clarify how we can make "--abort" work with old
> > > brokers,
> > > > > since without the additional field "Partitions" the tool needs to
> set
> > > the
> > > > > coordinator epoch correctly instead of "-1"? Arguably that's still
> > > doable
> > > > > but would require different call paths, and it's not clear whether
> > > that's
> > > > > worth doing for old versions.
> > > > >
> > > > > That's a good question. What I had in mind was to write the marker
> > > using
> > > > > the last coordinator epoch that was used by the respective
> > ProducerId.
> > > I
> > > > > realized that I left the coordinator epoch out of the
> > > `DescribeProducers`
> > > > > response, so I have updated the KIP to include it. It is possible
> > that
> > > > > there is no coordinator epoch associated with a given ProducerId
> > (e.g.
> > > if
> > > > > it is the first transaction from that producer), but in this case
> we
> > > can
> > > > > use 0.
> > > > >
> > > > > As for whether this is worth doing, I guess I would be more
> inclined
> > to
> > > > > leave it out if users had a reasonable alternative today to address
> > > this
> > > > > problem.
> > > > >
> > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent to
> only
> > > > > leaders
> > > > > while ListTransactions can be sent to any brokers? Or is it really
> > > > > "ListTransactions to be sent to coordinators only"? From the
> workflow
> > > > > you've described, based on the results back from DescribeProducers,
> > we
> > > > > should just immediately send ListTransactions to the
> > > > > corresponding coordinators based on the collected producer ids,
> > instead
> > > > of
> > > > > trying to send to any brokers right?
> > > > >
> > > > > I'm going to change `DescribeProducers` so that it can be handled
> by
> > > any
> > > > > replica of a topic partition. This was suggested by Lucas in order
> to
> > > > allow
> > > > > this API to be used for replica consistency testing. As far as
> > > > > `ListTransactions`, I was treating this similarly to `ListGroups`.
> > > > Although
> > > > > we know that the coordinators are the leaders of the
> > > __transaction_state
> > > > > partitions, this is more of an implementation detail. From an API
> > > > > perspective, we say that any broker could be a transaction
> > coordinator.
> > > > >
> > > > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> > > permission
> > > > > on
> > > > > the associated topic sufficient to allow any users to get all
> > producer
> > > > > information writing to the specific topic-partitions including last
> > > > > timestamp, txn-start-timestamp etc, which may be considered
> > sensitive?
> > > > > Should we require "ClusterAction" to only allow operators only?
> > > > >
> > > > > That's a fair point. Do you think `Read` permission would be
> > > reasonable?
> > > > > This is all information that could be obtained by reading the
> topic.
> > > > >
> > > > > Yeah that makes sense.
> > > >
> > > >
> > > > > > 4. From the example it seems "TxnStartOffset" should be included
> in
> > > the
> > > > > DescribeTransaction response schema? Otherwise the user would not
> get
> > > it
> > > > in
> > > > > the following WriteTxnMarker request.
> > > > >
> > > > > The `DescribeTransaction` API is sent to the transaction
> coordinator,
> > > > which
> > > > > does not know the start offset of a transaction in each topic
> > > partition.
> > > > > That is why we need `DescribeProducers`.
> > > > >
> > > >
> > > > Hmm, but the "TxnStartOffset" is not included in the
> DescribeProducers
> > > > response either?
> > > >
> > > >
> > > > >
> > > > > > 5. It is a bit easier for readers to highlight the added fields
> in
> > > the
> > > > > existing WriteTxnMarkerRequest (btw I read is that we are only
> adding
> > > > > "Partitions" with the starting offset, right?). Also as for its
> > > response
> > > > it
> > > > > seems we do not make any schema changes except adding one more
> > > potential
> > > > > error code "INVALID_TXN_STATE" to it, right? If that's the case we
> > can
> > > > just
> > > > > state that explicitly.
> > > > >
> > > > > I highlighted the new field in the request. For the response, the
> KIP
> > > > > states the following: "There are no changes to the response schema,
> > but
> > > > it
> > > > > will be bumped. Note that we are also enabling flexible version
> > > support."
> > > > >
> > > > > > 6. It is not clear to me for the overloaded function that the
> > > following
> > > > > option classes are not specified, what should be the default
> options?
> > > > > ...
> > > > >
> > > > > I was just trying to stick with existing conventions, but I will
> add
> > > some
> > > > > more detail here. I think we should probably still include
> > > > > `AbortTransactionOptions`. The `Options` classes are how users
> > override
> > > > > timeouts.
> > > > >
> > > > > > 7.1 Is "--broker" a required or optional (in that case I presume
> we
> > > > would
> > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > >
> > > > > I think it should be required as a reasonable way to limit the
> scope
> > of
> > > > the
> > > > > search. This is meant to be guided by metrics after all. If we do
> not
> > > > limit
> > > > > the scope to a single broker, then the behavior might get worse as
> > the
> > > > > cluster grows. I will clarify this.
> > > > >
> > > > > > 7.2 Seems "list-producers" is not exposed as a standalone feature
> > in
> > > > the
> > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > intentional?
> > > > > Personally I feel exposing a "--list-producers" may be useful too:
> if
> > > we
> > > > > believe the user has the right ACL, it is legitimate to return the
> > > > producer
> > > > > information to her anyways. But that is debatable in the meta point
> > 3)
> > > > > above.
> > > > >
> > > > > Yeah, I was planning to add this to support the use case that Lucas
> > > > > mentioned. There is some awkwardness since it is a little difficult
> > to
> > > > > convey different sources of information through the same command. I
> > > guess
> > > > > we can do `--list producers` and `--list transactions` and explain
> in
> > > the
> > > > > documentation. Maybe that is good enough.
> > > > >
> > > > > > 7.3 "Describing Transactions": we should also explain how that
> > would
> > > be
> > > > > executed, e.g. at least we should clarify that we would first find
> > the
> > > > > coordinator based on the transactional.id and hence users do not
> > need
> > > to
> > > > > specify one.
> > > > >
> > > > > Sure, makes sense.
> > > > >
> > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > "--broker"
> > > > > node
> > > > > as a required option? Otherwise we would not know which broker to
> > send
> > > > to.
> > > > >
> > > > > The --topic and --partition arguments are required, so the target
> is
> > > > always
> > > > > the leader of that partition.
> > > > >
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <
> > > bob.barrett@confluent.io
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Jason,
> > > > > >
> > > > > > Thanks for this KIP, I think this will be a huge operational
> > > > improvement
> > > > > > and overall it looks great to me.
> > > > > >
> > > > > > I'm not sure how much value the MaxActiveTransactionDuration
> metric
> > > > adds,
> > > > > > given that we have the --find-hanging option in the tool. As you
> > > > mention,
> > > > > > instances of these transactions are expected to be rare, and a
> > > > > > partition-level metric, which can generate a lot of data, seems
> > very
> > > > > > heavyweight for such a rare occurrence. I think "alert on
> > > > > > PartitionsWithLateTransactionsCount" followed by "run
> > > > kafka-transactions
> > > > > > --find-hanging on the relevant broker" is a reasonable process
> for
> > > > > cluster
> > > > > > operators to follow.
> > > > > >
> > > > > > Thanks,
> > > > > > Bob
> > > > > >
> > > > > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <
> wangguoz@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > Thanks for the written KIP. I think this is going to be a very
> > > useful
> > > > > > tool
> > > > > > > for operational improvements since with eos in its current
> stage,
> > > we
> > > > > > cannot
> > > > > > > confidently assert that we are bug-free, and even in the future
> > > when
> > > > we
> > > > > > are
> > > > > > > confident this is still going to be leveraged by older
> versioned
> > > > > brokers.
> > > > > > > Regarding the solution, I've also debated myself whether Kafka
> > > should
> > > > > > > "self-heal" automatically when detected in such situations, or
> > > should
> > > > > we
> > > > > > > instead build into ecosystem tooling to let operators do it.
> And
> > > I've
> > > > > > also
> > > > > > > convinced myself that the latter should be a better solution to
> > > keep
> > > > > > Kafka
> > > > > > > software itself simpler.
> > > > > > >
> > > > > > > Regarding the KIP itself, I have a few meta comments below:
> > > > > > >
> > > > > > > 1. I'd like to clarify how we can make "--abort" work with old
> > > > brokers,
> > > > > > > since without the additional field "Partitions" the tool needs
> to
> > > set
> > > > > the
> > > > > > > coordinator epoch correctly instead of "-1"? Arguably that's
> > still
> > > > > doable
> > > > > > > but would require different call paths, and it's not clear
> > whether
> > > > > that's
> > > > > > > worth doing for old versions.
> > > > > > >
> > > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent to
> > only
> > > > > > leaders
> > > > > > > while ListTransactions can be sent to any brokers? Or is it
> > really
> > > > > > > "ListTransactions to be sent to coordinators only"? From the
> > > workflow
> > > > > > > you've described, based on the results back from
> > DescribeProducers,
> > > > we
> > > > > > > should just immediately send ListTransactions to the
> > > > > > > corresponding coordinators based on the collected producer ids,
> > > > instead
> > > > > > of
> > > > > > > trying to send to any brokers right?
> > > > > > >
> > > > > > > Also I'm a bit concerned if "ListTransactions" could
> potentially
> > > > return
> > > > > > too
> > > > > > > much data with "StateFilters" set to all states, including
> > > completed
> > > > > > ones.
> > > > > > > Do we expect users ever want to know transactions that are not
> > > > pending?
> > > > > > On
> > > > > > > the other hand, maybe we can just require users to specify the
> > > > "pids[]"
> > > > > > in
> > > > > > > this request too to further filter those un-interested
> > > transactions.
> > > > > This
> > > > > > > also works well with the workflow: we know exactly from
> > > > > > "DescribeProducers"
> > > > > > > which pids are we diagnosing right now, so in the follow-up
> > > > > > > "ListTransactions" we should also only care for those
> partitions
> > > > only.
> > > > > > >
> > > > > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> > > > permission
> > > > > > on
> > > > > > > the associated topic sufficient to allow any users to get all
> > > > producer
> > > > > > > information writing to the specific topic-partitions including
> > last
> > > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > > sensitive?
> > > > > > > Should we require "ClusterAction" to only allow operators only?
> > > > > > >
> > > > > > > Below are more detailed comments:
> > > > > > >
> > > > > > > 4. From the example it seems "TxnStartOffset" should be
> included
> > in
> > > > the
> > > > > > > DescribeTransaction response schema? Otherwise the user would
> not
> > > get
> > > > > it
> > > > > > in
> > > > > > > the following WriteTxnMarker request.
> > > > > > >
> > > > > > > 5. It is a bit easier for readers to highlight the added fields
> > in
> > > > the
> > > > > > > existing WriteTxnMarkerRequest (btw I read is that we are only
> > > adding
> > > > > > > "Partitions" with the starting offset, right?). Also as for its
> > > > > response
> > > > > > it
> > > > > > > seems we do not make any schema changes except adding one more
> > > > > potential
> > > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the case
> > we
> > > > can
> > > > > > just
> > > > > > > state that explicitly.
> > > > > > >
> > > > > > > 6. It is not clear to me for the overloaded function that the
> > > > following
> > > > > > > option classes are not specified, what should be the default
> > > options?
> > > > > > >
> > > > > > > * ListTransactionsOptions
> > > > > > > * DescribeTransactionsOptions
> > > > > > > * DescribeProducersOptions
> > > > > > >
> > > > > > > Also, it seems AbortTransactionOptions would just be empty? If
> > yes
> > > do
> > > > > we
> > > > > > > really need this option class for now?
> > > > > > >
> > > > > > > 7. A couple questions from the cmd tool examples:
> > > > > > > 7.1 Is "--broker" a required or optional (in that case I
> presume
> > we
> > > > > would
> > > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > > 7.2 Seems "list-producers" is not exposed as a standalone
> feature
> > > in
> > > > > the
> > > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > > intentional?
> > > > > > > Personally I feel exposing a "--list-producers" may be useful
> > too:
> > > if
> > > > > we
> > > > > > > believe the user has the right ACL, it is legitimate to return
> > the
> > > > > > producer
> > > > > > > information to her anyways. But that is debatable in the meta
> > point
> > > > 3)
> > > > > > > above.
> > > > > > > 7.3 "Describing Transactions": we should also explain how that
> > > would
> > > > be
> > > > > > > executed, e.g. at least we should clarify that we would first
> > find
> > > > the
> > > > > > > coordinator based on the transactional.id and hence users do
> not
> > > > need
> > > > > to
> > > > > > > specify one.
> > > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > > "--broker"
> > > > > > node
> > > > > > > as a required option? Otherwise we would not know which broker
> to
> > > > send
> > > > > > to.
> > > > > > >
> > > > > > >
> > > > > > > Overall, nice written one, thanks Jason.
> > > > > > >
> > > > > > > Guozhang
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <
> > > > lucas@confluent.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > >> Would it be worth returning
> transactional.id.expiration.ms
> > in
> > > > the
> > > > > > > > DescribeProducersResponse?
> > > > > > > >
> > > > > > > > > That's an interesting thought as well. Are you trying to
> > avoid
> > > > the
> > > > > > need
> > > > > > > > to
> > > > > > > > specify it through the command line? The tool could also
> query
> > > the
> > > > > > value
> > > > > > > > with DescribeConfigs I suppose.
> > > > > > > >
> > > > > > > > Basically. I'm not sure how useful this will be in practice,
> > > though
> > > > > it
> > > > > > > > might help when debugging.
> > > > > > > >
> > > > > > > > Lucas
> > > > > > > >
> > > > > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <
> > > > jason@confluent.io
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Lucas,
> > > > > > > > >
> > > > > > > > > Thanks for the comments. Responses below:
> > > > > > > > >
> > > > > > > > > > Given that it's possible for replica producer states to
> > > diverge
> > > > > > from
> > > > > > > > each
> > > > > > > > > other, it would be very useful if
> > > > > DescribeProducers(Request,Response)
> > > > > > > and
> > > > > > > > > tooling is able to query all partition replicas for their
> > > > producers
> > > > > > > > >
> > > > > > > > > Yes, it makes sense to me to let DescribeProducers work on
> > both
> > > > > > > followers
> > > > > > > > > and leaders. In fact, I'm encouraged that there are use
> cases
> > > for
> > > > > > this
> > > > > > > > work
> > > > > > > > > other than detecting hanging transactions. That was indeed
> > the
> > > > > hope,
> > > > > > > but
> > > > > > > > I
> > > > > > > > > didn't have anything specific in mind. I will update the
> > > > proposal.
> > > > > > > > >
> > > > > > > > > > Would it be worth returning
> transactional.id.expiration.ms
> > > in
> > > > > the
> > > > > > > > > DescribeProducersResponse?
> > > > > > > > >
> > > > > > > > > That's an interesting thought as well. Are you trying to
> > avoid
> > > > the
> > > > > > need
> > > > > > > > to
> > > > > > > > > specify it through the command line? The tool could also
> > query
> > > > the
> > > > > > > value
> > > > > > > > > with DescribeConfigs I suppose.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Jason
> > > > > > > > >
> > > > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > > > > > lucas@confluent.io>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > This looks like a very useful tool, thanks for writing it
> > up.
> > > > > > > > > >
> > > > > > > > > > Given that it's possible for replica producer states to
> > > diverge
> > > > > > from
> > > > > > > > each
> > > > > > > > > > other, it would be very useful if
> > > > > > DescribeProducers(Request,Response)
> > > > > > > > and
> > > > > > > > > > tooling is able to query all partition replicas for their
> > > > > > producers.
> > > > > > > > One
> > > > > > > > > > way I can see this being used immediately is in kafka's
> > > system
> > > > > > tests,
> > > > > > > > > > especially the ones that inject failures. At the end of
> the
> > > > test
> > > > > we
> > > > > > > can
> > > > > > > > > > query all replicas and make sure that their states have
> not
> > > > > > > diverged. I
> > > > > > > > > can
> > > > > > > > > > also see it being useful when debugging production
> clusters
> > > > too.
> > > > > > > > > >
> > > > > > > > > > Would it be worth returning
> transactional.id.expiration.ms
> > > in
> > > > > the
> > > > > > > > > > DescribeProducersResponse?
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > >
> > > > > > > > > > Lucas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> > > > > rndgstn@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > > > > > > > > >
> > > > > > > > > > > Ron
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > > > > > jason@confluent.io>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hey Ron,
> > > > > > > > > > > >
> > > > > > > > > > > > We do not typically backport new APIs to older
> > versions.
> > > I
> > > > > > think
> > > > > > > we
> > > > > > > > > can
> > > > > > > > > > > > however make the --abort command compatible with
> older
> > > > > > versions.
> > > > > > > It
> > > > > > > > > > would
> > > > > > > > > > > > require a user to do some analysis on their own to
> > > > identify a
> > > > > > > > hanging
> > > > > > > > > > > > transaction, but then they can use the tool from a
> new
> > > > > release
> > > > > > to
> > > > > > > > > > > recover.
> > > > > > > > > > > > For example, users could detect a hanging transaction
> > > > through
> > > > > > the
> > > > > > > > > > > existing
> > > > > > > > > > > > "LastStableOffsetLag" metric and then collect the
> > needed
> > > > > > > > information
> > > > > > > > > > > from a
> > > > > > > > > > > > dump of the log (or producer snapshot). It's more
> work,
> > > but
> > > > > at
> > > > > > > > least
> > > > > > > > > > it's
> > > > > > > > > > > > possible. Does that sound fair?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Jason
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> > > > > > > rndgstn@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Will the implementation be backported to prior
> Kafka
> > > > > > versions?
> > > > > > > > The
> > > > > > > > > > > > reason
> > > > > > > > > > > > > I ask is because if it is not backported and
> similar
> > > > > > > > functionality
> > > > > > > > > is
> > > > > > > > > > > not
> > > > > > > > > > > > > otherwise made available for older versions, then
> the
> > > > only
> > > > > > > > recourse
> > > > > > > > > > > > (aside
> > > > > > > > > > > > > from deleting and recreating the topic as you
> pointed
> > > > out)
> > > > > > may
> > > > > > > be
> > > > > > > > > to
> > > > > > > > > > > > > upgrade to 2.7 (or whatever version ends up getting
> > > this
> > > > > > > > > > > functionality).
> > > > > > > > > > > > > Such an upgrade may not be desirable, especially if
> > the
> > > > > > number
> > > > > > > of
> > > > > > > > > > > > > intermediate versions is considerable. I understand
> > the
> > > > > > mantra
> > > > > > > of
> > > > > > > > > > > "never
> > > > > > > > > > > > > fall too many versions behind" but the reality of
> it
> > is
> > > > > that
> > > > > > it
> > > > > > > > > isn't
> > > > > > > > > > > > > always the case.  Even if the version is relatively
> > > > recent,
> > > > > > an
> > > > > > > > > > upgrade
> > > > > > > > > > > > may
> > > > > > > > > > > > > still not be possible for some time, and a quicker
> > > > > resolution
> > > > > > > may
> > > > > > > > > be
> > > > > > > > > > > > > necessary.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ron
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > > > > > > > jason@confluent.io>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I've added a proposal to handle the problem of
> > > hanging
> > > > > > > > > > transactions:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > > > > > .
> > > > > > > > > > > > > > In theory, this should never happen. In practice,
> > we
> > > > have
> > > > > > hit
> > > > > > > > one
> > > > > > > > > > bug
> > > > > > > > > > > > > where
> > > > > > > > > > > > > > it was possible and there are few good options
> > today
> > > to
> > > > > > > > recover.
> > > > > > > > > > > Take a
> > > > > > > > > > > > > > look and let me know what you think.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Jason
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > -- Guozhang
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Tom Bentley <tb...@redhat.com>.
Hi Jason,

The KIP looks good to me, but I had one question. AFAIU the LastTimestamp
column in the output of --describe-producers and --find-hanging is there so
the users of the tool know the txnLastUpdateTimestamp of the
TransactionMetadata and from that and the (max) timeout can infer something
about the likelihood that this really is a stuck transaction. If that's the
case then what is the benefit in displaying it as a ms offset from the unix
epoch, rather than an actual date time?

Thanks,

Tom

On Mon, Aug 31, 2020 at 11:28 PM Guozhang Wang <wa...@gmail.com> wrote:

> Thanks Jason, I do not have more comments on the KIP then.
>
> On Mon, Aug 31, 2020 at 3:19 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > > Hmm, but the "TxnStartOffset" is not included in the DescribeProducers
> > response either?
> >
> > Oh, I accidentally called it `CurrentTxnStartTimestamp` in the schema.
> > Fixed now!
> >
> > -Jason
> >
> > On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hey Guozhang,
> > > >
> > > > Thanks for the detailed comments. Responses inline:
> > > >
> > > > > 1. I'd like to clarify how we can make "--abort" work with old
> > brokers,
> > > > since without the additional field "Partitions" the tool needs to set
> > the
> > > > coordinator epoch correctly instead of "-1"? Arguably that's still
> > doable
> > > > but would require different call paths, and it's not clear whether
> > that's
> > > > worth doing for old versions.
> > > >
> > > > That's a good question. What I had in mind was to write the marker
> > using
> > > > the last coordinator epoch that was used by the respective
> ProducerId.
> > I
> > > > realized that I left the coordinator epoch out of the
> > `DescribeProducers`
> > > > response, so I have updated the KIP to include it. It is possible
> that
> > > > there is no coordinator epoch associated with a given ProducerId
> (e.g.
> > if
> > > > it is the first transaction from that producer), but in this case we
> > can
> > > > use 0.
> > > >
> > > > As for whether this is worth doing, I guess I would be more inclined
> to
> > > > leave it out if users had a reasonable alternative today to address
> > this
> > > > problem.
> > > >
> > > > > 2. Why do we have to enforce "DescribeProducers" to be sent to only
> > > > leaders
> > > > while ListTransactions can be sent to any brokers? Or is it really
> > > > "ListTransactions to be sent to coordinators only"? From the workflow
> > > > you've described, based on the results back from DescribeProducers,
> we
> > > > should just immediately send ListTransactions to the
> > > > corresponding coordinators based on the collected producer ids,
> instead
> > > of
> > > > trying to send to any brokers right?
> > > >
> > > > I'm going to change `DescribeProducers` so that it can be handled by
> > any
> > > > replica of a topic partition. This was suggested by Lucas in order to
> > > allow
> > > > this API to be used for replica consistency testing. As far as
> > > > `ListTransactions`, I was treating this similarly to `ListGroups`.
> > > Although
> > > > we know that the coordinators are the leaders of the
> > __transaction_state
> > > > partitions, this is more of an implementation detail. From an API
> > > > perspective, we say that any broker could be a transaction
> coordinator.
> > > >
> > > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> > permission
> > > > on
> > > > the associated topic sufficient to allow any users to get all
> producer
> > > > information writing to the specific topic-partitions including last
> > > > timestamp, txn-start-timestamp etc, which may be considered
> sensitive?
> > > > Should we require "ClusterAction" to only allow operators only?
> > > >
> > > > That's a fair point. Do you think `Read` permission would be
> > reasonable?
> > > > This is all information that could be obtained by reading the topic.
> > > >
> > > > Yeah that makes sense.
> > >
> > >
> > > > > 4. From the example it seems "TxnStartOffset" should be included in
> > the
> > > > DescribeTransaction response schema? Otherwise the user would not get
> > it
> > > in
> > > > the following WriteTxnMarker request.
> > > >
> > > > The `DescribeTransaction` API is sent to the transaction coordinator,
> > > which
> > > > does not know the start offset of a transaction in each topic
> > partition.
> > > > That is why we need `DescribeProducers`.
> > > >
> > >
> > > Hmm, but the "TxnStartOffset" is not included in the DescribeProducers
> > > response either?
> > >
> > >
> > > >
> > > > > 5. It is a bit easier for readers to highlight the added fields in
> > the
> > > > existing WriteTxnMarkerRequest (btw I read is that we are only adding
> > > > "Partitions" with the starting offset, right?). Also as for its
> > response
> > > it
> > > > seems we do not make any schema changes except adding one more
> > potential
> > > > error code "INVALID_TXN_STATE" to it, right? If that's the case we
> can
> > > just
> > > > state that explicitly.
> > > >
> > > > I highlighted the new field in the request. For the response, the KIP
> > > > states the following: "There are no changes to the response schema,
> but
> > > it
> > > > will be bumped. Note that we are also enabling flexible version
> > support."
> > > >
> > > > > 6. It is not clear to me for the overloaded function that the
> > following
> > > > option classes are not specified, what should be the default options?
> > > > ...
> > > >
> > > > I was just trying to stick with existing conventions, but I will add
> > some
> > > > more detail here. I think we should probably still include
> > > > `AbortTransactionOptions`. The `Options` classes are how users
> override
> > > > timeouts.
> > > >
> > > > > 7.1 Is "--broker" a required or optional (in that case I presume we
> > > would
> > > > just query all brokers iteratively) in "--find-hanging"?
> > > >
> > > > I think it should be required as a reasonable way to limit the scope
> of
> > > the
> > > > search. This is meant to be guided by metrics after all. If we do not
> > > limit
> > > > the scope to a single broker, then the behavior might get worse as
> the
> > > > cluster grows. I will clarify this.
> > > >
> > > > > 7.2 Seems "list-producers" is not exposed as a standalone feature
> in
> > > the
> > > > cmd but only used in the wrapping "--find-hanging", is that
> > intentional?
> > > > Personally I feel exposing a "--list-producers" may be useful too: if
> > we
> > > > believe the user has the right ACL, it is legitimate to return the
> > > producer
> > > > information to her anyways. But that is debatable in the meta point
> 3)
> > > > above.
> > > >
> > > > Yeah, I was planning to add this to support the use case that Lucas
> > > > mentioned. There is some awkwardness since it is a little difficult
> to
> > > > convey different sources of information through the same command. I
> > guess
> > > > we can do `--list producers` and `--list transactions` and explain in
> > the
> > > > documentation. Maybe that is good enough.
> > > >
> > > > > 7.3 "Describing Transactions": we should also explain how that
> would
> > be
> > > > executed, e.g. at least we should clarify that we would first find
> the
> > > > coordinator based on the transactional.id and hence users do not
> need
> > to
> > > > specify one.
> > > >
> > > > Sure, makes sense.
> > > >
> > > > > 7.4. In "Aborting Transactions", should we also specify the
> > "--broker"
> > > > node
> > > > as a required option? Otherwise we would not know which broker to
> send
> > > to.
> > > >
> > > > The --topic and --partition arguments are required, so the target is
> > > always
> > > > the leader of that partition.
> > > >
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > >
> > > >
> > > > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <
> > bob.barrett@confluent.io
> > > >
> > > > wrote:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > Thanks for this KIP, I think this will be a huge operational
> > > improvement
> > > > > and overall it looks great to me.
> > > > >
> > > > > I'm not sure how much value the MaxActiveTransactionDuration metric
> > > adds,
> > > > > given that we have the --find-hanging option in the tool. As you
> > > mention,
> > > > > instances of these transactions are expected to be rare, and a
> > > > > partition-level metric, which can generate a lot of data, seems
> very
> > > > > heavyweight for such a rare occurrence. I think "alert on
> > > > > PartitionsWithLateTransactionsCount" followed by "run
> > > kafka-transactions
> > > > > --find-hanging on the relevant broker" is a reasonable process for
> > > > cluster
> > > > > operators to follow.
> > > > >
> > > > > Thanks,
> > > > > Bob
> > > > >
> > > > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <wa...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Jason,
> > > > > >
> > > > > > Thanks for the written KIP. I think this is going to be a very
> > useful
> > > > > tool
> > > > > > for operational improvements since with eos in its current stage,
> > we
> > > > > cannot
> > > > > > confidently assert that we are bug-free, and even in the future
> > when
> > > we
> > > > > are
> > > > > > confident this is still going to be leveraged by older versioned
> > > > brokers.
> > > > > > Regarding the solution, I've also debated myself whether Kafka
> > should
> > > > > > "self-heal" automatically when detected in such situations, or
> > should
> > > > we
> > > > > > instead build into ecosystem tooling to let operators do it. And
> > I've
> > > > > also
> > > > > > convinced myself that the latter should be a better solution to
> > keep
> > > > > Kafka
> > > > > > software itself simpler.
> > > > > >
> > > > > > Regarding the KIP itself, I have a few meta comments below:
> > > > > >
> > > > > > 1. I'd like to clarify how we can make "--abort" work with old
> > > brokers,
> > > > > > since without the additional field "Partitions" the tool needs to
> > set
> > > > the
> > > > > > coordinator epoch correctly instead of "-1"? Arguably that's
> still
> > > > doable
> > > > > > but would require different call paths, and it's not clear
> whether
> > > > that's
> > > > > > worth doing for old versions.
> > > > > >
> > > > > > 2. Why do we have to enforce "DescribeProducers" to be sent to
> only
> > > > > leaders
> > > > > > while ListTransactions can be sent to any brokers? Or is it
> really
> > > > > > "ListTransactions to be sent to coordinators only"? From the
> > workflow
> > > > > > you've described, based on the results back from
> DescribeProducers,
> > > we
> > > > > > should just immediately send ListTransactions to the
> > > > > > corresponding coordinators based on the collected producer ids,
> > > instead
> > > > > of
> > > > > > trying to send to any brokers right?
> > > > > >
> > > > > > Also I'm a bit concerned if "ListTransactions" could potentially
> > > return
> > > > > too
> > > > > > much data with "StateFilters" set to all states, including
> > completed
> > > > > ones.
> > > > > > Do we expect users ever want to know transactions that are not
> > > pending?
> > > > > On
> > > > > > the other hand, maybe we can just require users to specify the
> > > "pids[]"
> > > > > in
> > > > > > this request too to further filter those un-interested
> > transactions.
> > > > This
> > > > > > also works well with the workflow: we know exactly from
> > > > > "DescribeProducers"
> > > > > > which pids are we diagnosing right now, so in the follow-up
> > > > > > "ListTransactions" we should also only care for those partitions
> > > only.
> > > > > >
> > > > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> > > permission
> > > > > on
> > > > > > the associated topic sufficient to allow any users to get all
> > > producer
> > > > > > information writing to the specific topic-partitions including
> last
> > > > > > timestamp, txn-start-timestamp etc, which may be considered
> > > sensitive?
> > > > > > Should we require "ClusterAction" to only allow operators only?
> > > > > >
> > > > > > Below are more detailed comments:
> > > > > >
> > > > > > 4. From the example it seems "TxnStartOffset" should be included
> in
> > > the
> > > > > > DescribeTransaction response schema? Otherwise the user would not
> > get
> > > > it
> > > > > in
> > > > > > the following WriteTxnMarker request.
> > > > > >
> > > > > > 5. It is a bit easier for readers to highlight the added fields
> in
> > > the
> > > > > > existing WriteTxnMarkerRequest (btw I read is that we are only
> > adding
> > > > > > "Partitions" with the starting offset, right?). Also as for its
> > > > response
> > > > > it
> > > > > > seems we do not make any schema changes except adding one more
> > > > potential
> > > > > > error code "INVALID_TXN_STATE" to it, right? If that's the case
> we
> > > can
> > > > > just
> > > > > > state that explicitly.
> > > > > >
> > > > > > 6. It is not clear to me for the overloaded function that the
> > > following
> > > > > > option classes are not specified, what should be the default
> > options?
> > > > > >
> > > > > > * ListTransactionsOptions
> > > > > > * DescribeTransactionsOptions
> > > > > > * DescribeProducersOptions
> > > > > >
> > > > > > Also, it seems AbortTransactionOptions would just be empty? If
> yes
> > do
> > > > we
> > > > > > really need this option class for now?
> > > > > >
> > > > > > 7. A couple questions from the cmd tool examples:
> > > > > > 7.1 Is "--broker" a required or optional (in that case I presume
> we
> > > > would
> > > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > > 7.2 Seems "list-producers" is not exposed as a standalone feature
> > in
> > > > the
> > > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > > intentional?
> > > > > > Personally I feel exposing a "--list-producers" may be useful
> too:
> > if
> > > > we
> > > > > > believe the user has the right ACL, it is legitimate to return
> the
> > > > > producer
> > > > > > information to her anyways. But that is debatable in the meta
> point
> > > 3)
> > > > > > above.
> > > > > > 7.3 "Describing Transactions": we should also explain how that
> > would
> > > be
> > > > > > executed, e.g. at least we should clarify that we would first
> find
> > > the
> > > > > > coordinator based on the transactional.id and hence users do not
> > > need
> > > > to
> > > > > > specify one.
> > > > > > 7.4. In "Aborting Transactions", should we also specify the
> > > "--broker"
> > > > > node
> > > > > > as a required option? Otherwise we would not know which broker to
> > > send
> > > > > to.
> > > > > >
> > > > > >
> > > > > > Overall, nice written one, thanks Jason.
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <
> > > lucas@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > >> Would it be worth returning transactional.id.expiration.ms
> in
> > > the
> > > > > > > DescribeProducersResponse?
> > > > > > >
> > > > > > > > That's an interesting thought as well. Are you trying to
> avoid
> > > the
> > > > > need
> > > > > > > to
> > > > > > > specify it through the command line? The tool could also query
> > the
> > > > > value
> > > > > > > with DescribeConfigs I suppose.
> > > > > > >
> > > > > > > Basically. I'm not sure how useful this will be in practice,
> > though
> > > > it
> > > > > > > might help when debugging.
> > > > > > >
> > > > > > > Lucas
> > > > > > >
> > > > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <
> > > jason@confluent.io
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Lucas,
> > > > > > > >
> > > > > > > > Thanks for the comments. Responses below:
> > > > > > > >
> > > > > > > > > Given that it's possible for replica producer states to
> > diverge
> > > > > from
> > > > > > > each
> > > > > > > > other, it would be very useful if
> > > > DescribeProducers(Request,Response)
> > > > > > and
> > > > > > > > tooling is able to query all partition replicas for their
> > > producers
> > > > > > > >
> > > > > > > > Yes, it makes sense to me to let DescribeProducers work on
> both
> > > > > > followers
> > > > > > > > and leaders. In fact, I'm encouraged that there are use cases
> > for
> > > > > this
> > > > > > > work
> > > > > > > > other than detecting hanging transactions. That was indeed
> the
> > > > hope,
> > > > > > but
> > > > > > > I
> > > > > > > > didn't have anything specific in mind. I will update the
> > > proposal.
> > > > > > > >
> > > > > > > > > Would it be worth returning transactional.id.expiration.ms
> > in
> > > > the
> > > > > > > > DescribeProducersResponse?
> > > > > > > >
> > > > > > > > That's an interesting thought as well. Are you trying to
> avoid
> > > the
> > > > > need
> > > > > > > to
> > > > > > > > specify it through the command line? The tool could also
> query
> > > the
> > > > > > value
> > > > > > > > with DescribeConfigs I suppose.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > > > > lucas@confluent.io>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > This looks like a very useful tool, thanks for writing it
> up.
> > > > > > > > >
> > > > > > > > > Given that it's possible for replica producer states to
> > diverge
> > > > > from
> > > > > > > each
> > > > > > > > > other, it would be very useful if
> > > > > DescribeProducers(Request,Response)
> > > > > > > and
> > > > > > > > > tooling is able to query all partition replicas for their
> > > > > producers.
> > > > > > > One
> > > > > > > > > way I can see this being used immediately is in kafka's
> > system
> > > > > tests,
> > > > > > > > > especially the ones that inject failures. At the end of the
> > > test
> > > > we
> > > > > > can
> > > > > > > > > query all replicas and make sure that their states have not
> > > > > > diverged. I
> > > > > > > > can
> > > > > > > > > also see it being useful when debugging production clusters
> > > too.
> > > > > > > > >
> > > > > > > > > Would it be worth returning transactional.id.expiration.ms
> > in
> > > > the
> > > > > > > > > DescribeProducersResponse?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > >
> > > > > > > > > Lucas
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> > > > rndgstn@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > > > > > > > >
> > > > > > > > > > Ron
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > > > > jason@confluent.io>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hey Ron,
> > > > > > > > > > >
> > > > > > > > > > > We do not typically backport new APIs to older
> versions.
> > I
> > > > > think
> > > > > > we
> > > > > > > > can
> > > > > > > > > > > however make the --abort command compatible with older
> > > > > versions.
> > > > > > It
> > > > > > > > > would
> > > > > > > > > > > require a user to do some analysis on their own to
> > > identify a
> > > > > > > hanging
> > > > > > > > > > > transaction, but then they can use the tool from a new
> > > > release
> > > > > to
> > > > > > > > > > recover.
> > > > > > > > > > > For example, users could detect a hanging transaction
> > > through
> > > > > the
> > > > > > > > > > existing
> > > > > > > > > > > "LastStableOffsetLag" metric and then collect the
> needed
> > > > > > > information
> > > > > > > > > > from a
> > > > > > > > > > > dump of the log (or producer snapshot). It's more work,
> > but
> > > > at
> > > > > > > least
> > > > > > > > > it's
> > > > > > > > > > > possible. Does that sound fair?
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Jason
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> > > > > > rndgstn@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > > > > > > > >
> > > > > > > > > > > > Will the implementation be backported to prior Kafka
> > > > > versions?
> > > > > > > The
> > > > > > > > > > > reason
> > > > > > > > > > > > I ask is because if it is not backported and similar
> > > > > > > functionality
> > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > > > otherwise made available for older versions, then the
> > > only
> > > > > > > recourse
> > > > > > > > > > > (aside
> > > > > > > > > > > > from deleting and recreating the topic as you pointed
> > > out)
> > > > > may
> > > > > > be
> > > > > > > > to
> > > > > > > > > > > > upgrade to 2.7 (or whatever version ends up getting
> > this
> > > > > > > > > > functionality).
> > > > > > > > > > > > Such an upgrade may not be desirable, especially if
> the
> > > > > number
> > > > > > of
> > > > > > > > > > > > intermediate versions is considerable. I understand
> the
> > > > > mantra
> > > > > > of
> > > > > > > > > > "never
> > > > > > > > > > > > fall too many versions behind" but the reality of it
> is
> > > > that
> > > > > it
> > > > > > > > isn't
> > > > > > > > > > > > always the case.  Even if the version is relatively
> > > recent,
> > > > > an
> > > > > > > > > upgrade
> > > > > > > > > > > may
> > > > > > > > > > > > still not be possible for some time, and a quicker
> > > > resolution
> > > > > > may
> > > > > > > > be
> > > > > > > > > > > > necessary.
> > > > > > > > > > > >
> > > > > > > > > > > > Ron
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > > > > > > jason@confluent.io>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi All,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I've added a proposal to handle the problem of
> > hanging
> > > > > > > > > transactions:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > > > > .
> > > > > > > > > > > > > In theory, this should never happen. In practice,
> we
> > > have
> > > > > hit
> > > > > > > one
> > > > > > > > > bug
> > > > > > > > > > > > where
> > > > > > > > > > > > > it was possible and there are few good options
> today
> > to
> > > > > > > recover.
> > > > > > > > > > Take a
> > > > > > > > > > > > > look and let me know what you think.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Jason
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > -- Guozhang
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Guozhang Wang <wa...@gmail.com>.
Thanks Jason, I do not have more comments on the KIP then.

On Mon, Aug 31, 2020 at 3:19 PM Jason Gustafson <ja...@confluent.io> wrote:

> > Hmm, but the "TxnStartOffset" is not included in the DescribeProducers
> response either?
>
> Oh, I accidentally called it `CurrentTxnStartTimestamp` in the schema.
> Fixed now!
>
> -Jason
>
> On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hey Guozhang,
> > >
> > > Thanks for the detailed comments. Responses inline:
> > >
> > > > 1. I'd like to clarify how we can make "--abort" work with old
> brokers,
> > > since without the additional field "Partitions" the tool needs to set
> the
> > > coordinator epoch correctly instead of "-1"? Arguably that's still
> doable
> > > but would require different call paths, and it's not clear whether
> that's
> > > worth doing for old versions.
> > >
> > > That's a good question. What I had in mind was to write the marker
> using
> > > the last coordinator epoch that was used by the respective ProducerId.
> I
> > > realized that I left the coordinator epoch out of the
> `DescribeProducers`
> > > response, so I have updated the KIP to include it. It is possible that
> > > there is no coordinator epoch associated with a given ProducerId (e.g.
> if
> > > it is the first transaction from that producer), but in this case we
> can
> > > use 0.
> > >
> > > As for whether this is worth doing, I guess I would be more inclined to
> > > leave it out if users had a reasonable alternative today to address
> this
> > > problem.
> > >
> > > > 2. Why do we have to enforce "DescribeProducers" to be sent to only
> > > leaders
> > > while ListTransactions can be sent to any brokers? Or is it really
> > > "ListTransactions to be sent to coordinators only"? From the workflow
> > > you've described, based on the results back from DescribeProducers, we
> > > should just immediately send ListTransactions to the
> > > corresponding coordinators based on the collected producer ids, instead
> > of
> > > trying to send to any brokers right?
> > >
> > > I'm going to change `DescribeProducers` so that it can be handled by
> any
> > > replica of a topic partition. This was suggested by Lucas in order to
> > allow
> > > this API to be used for replica consistency testing. As far as
> > > `ListTransactions`, I was treating this similarly to `ListGroups`.
> > Although
> > > we know that the coordinators are the leaders of the
> __transaction_state
> > > partitions, this is more of an implementation detail. From an API
> > > perspective, we say that any broker could be a transaction coordinator.
> > >
> > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> permission
> > > on
> > > the associated topic sufficient to allow any users to get all producer
> > > information writing to the specific topic-partitions including last
> > > timestamp, txn-start-timestamp etc, which may be considered sensitive?
> > > Should we require "ClusterAction" to only allow operators only?
> > >
> > > That's a fair point. Do you think `Read` permission would be
> reasonable?
> > > This is all information that could be obtained by reading the topic.
> > >
> > > Yeah that makes sense.
> >
> >
> > > > 4. From the example it seems "TxnStartOffset" should be included in
> the
> > > DescribeTransaction response schema? Otherwise the user would not get
> it
> > in
> > > the following WriteTxnMarker request.
> > >
> > > The `DescribeTransaction` API is sent to the transaction coordinator,
> > which
> > > does not know the start offset of a transaction in each topic
> partition.
> > > That is why we need `DescribeProducers`.
> > >
> >
> > Hmm, but the "TxnStartOffset" is not included in the DescribeProducers
> > response either?
> >
> >
> > >
> > > > 5. It is a bit easier for readers to highlight the added fields in
> the
> > > existing WriteTxnMarkerRequest (btw I read is that we are only adding
> > > "Partitions" with the starting offset, right?). Also as for its
> response
> > it
> > > seems we do not make any schema changes except adding one more
> potential
> > > error code "INVALID_TXN_STATE" to it, right? If that's the case we can
> > just
> > > state that explicitly.
> > >
> > > I highlighted the new field in the request. For the response, the KIP
> > > states the following: "There are no changes to the response schema, but
> > it
> > > will be bumped. Note that we are also enabling flexible version
> support."
> > >
> > > > 6. It is not clear to me for the overloaded function that the
> following
> > > option classes are not specified, what should be the default options?
> > > ...
> > >
> > > I was just trying to stick with existing conventions, but I will add
> some
> > > more detail here. I think we should probably still include
> > > `AbortTransactionOptions`. The `Options` classes are how users override
> > > timeouts.
> > >
> > > > 7.1 Is "--broker" a required or optional (in that case I presume we
> > would
> > > just query all brokers iteratively) in "--find-hanging"?
> > >
> > > I think it should be required as a reasonable way to limit the scope of
> > the
> > > search. This is meant to be guided by metrics after all. If we do not
> > limit
> > > the scope to a single broker, then the behavior might get worse as the
> > > cluster grows. I will clarify this.
> > >
> > > > 7.2 Seems "list-producers" is not exposed as a standalone feature in
> > the
> > > cmd but only used in the wrapping "--find-hanging", is that
> intentional?
> > > Personally I feel exposing a "--list-producers" may be useful too: if
> we
> > > believe the user has the right ACL, it is legitimate to return the
> > producer
> > > information to her anyways. But that is debatable in the meta point 3)
> > > above.
> > >
> > > Yeah, I was planning to add this to support the use case that Lucas
> > > mentioned. There is some awkwardness since it is a little difficult to
> > > convey different sources of information through the same command. I
> guess
> > > we can do `--list producers` and `--list transactions` and explain in
> the
> > > documentation. Maybe that is good enough.
> > >
> > > > 7.3 "Describing Transactions": we should also explain how that would
> be
> > > executed, e.g. at least we should clarify that we would first find the
> > > coordinator based on the transactional.id and hence users do not need
> to
> > > specify one.
> > >
> > > Sure, makes sense.
> > >
> > > > 7.4. In "Aborting Transactions", should we also specify the
> "--broker"
> > > node
> > > as a required option? Otherwise we would not know which broker to send
> > to.
> > >
> > > The --topic and --partition arguments are required, so the target is
> > always
> > > the leader of that partition.
> > >
> > >
> > > Thanks,
> > > Jason
> > >
> > >
> > >
> > > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <
> bob.barrett@confluent.io
> > >
> > > wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > Thanks for this KIP, I think this will be a huge operational
> > improvement
> > > > and overall it looks great to me.
> > > >
> > > > I'm not sure how much value the MaxActiveTransactionDuration metric
> > adds,
> > > > given that we have the --find-hanging option in the tool. As you
> > mention,
> > > > instances of these transactions are expected to be rare, and a
> > > > partition-level metric, which can generate a lot of data, seems very
> > > > heavyweight for such a rare occurrence. I think "alert on
> > > > PartitionsWithLateTransactionsCount" followed by "run
> > kafka-transactions
> > > > --find-hanging on the relevant broker" is a reasonable process for
> > > cluster
> > > > operators to follow.
> > > >
> > > > Thanks,
> > > > Bob
> > > >
> > > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <wa...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > Thanks for the written KIP. I think this is going to be a very
> useful
> > > > tool
> > > > > for operational improvements since with eos in its current stage,
> we
> > > > cannot
> > > > > confidently assert that we are bug-free, and even in the future
> when
> > we
> > > > are
> > > > > confident this is still going to be leveraged by older versioned
> > > brokers.
> > > > > Regarding the solution, I've also debated myself whether Kafka
> should
> > > > > "self-heal" automatically when detected in such situations, or
> should
> > > we
> > > > > instead build into ecosystem tooling to let operators do it. And
> I've
> > > > also
> > > > > convinced myself that the latter should be a better solution to
> keep
> > > > Kafka
> > > > > software itself simpler.
> > > > >
> > > > > Regarding the KIP itself, I have a few meta comments below:
> > > > >
> > > > > 1. I'd like to clarify how we can make "--abort" work with old
> > brokers,
> > > > > since without the additional field "Partitions" the tool needs to
> set
> > > the
> > > > > coordinator epoch correctly instead of "-1"? Arguably that's still
> > > doable
> > > > > but would require different call paths, and it's not clear whether
> > > that's
> > > > > worth doing for old versions.
> > > > >
> > > > > 2. Why do we have to enforce "DescribeProducers" to be sent to only
> > > > leaders
> > > > > while ListTransactions can be sent to any brokers? Or is it really
> > > > > "ListTransactions to be sent to coordinators only"? From the
> workflow
> > > > > you've described, based on the results back from DescribeProducers,
> > we
> > > > > should just immediately send ListTransactions to the
> > > > > corresponding coordinators based on the collected producer ids,
> > instead
> > > > of
> > > > > trying to send to any brokers right?
> > > > >
> > > > > Also I'm a bit concerned if "ListTransactions" could potentially
> > return
> > > > too
> > > > > much data with "StateFilters" set to all states, including
> completed
> > > > ones.
> > > > > Do we expect users ever want to know transactions that are not
> > pending?
> > > > On
> > > > > the other hand, maybe we can just require users to specify the
> > "pids[]"
> > > > in
> > > > > this request too to further filter those un-interested
> transactions.
> > > This
> > > > > also works well with the workflow: we know exactly from
> > > > "DescribeProducers"
> > > > > which pids are we diagnosing right now, so in the follow-up
> > > > > "ListTransactions" we should also only care for those partitions
> > only.
> > > > >
> > > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> > permission
> > > > on
> > > > > the associated topic sufficient to allow any users to get all
> > producer
> > > > > information writing to the specific topic-partitions including last
> > > > > timestamp, txn-start-timestamp etc, which may be considered
> > sensitive?
> > > > > Should we require "ClusterAction" to only allow operators only?
> > > > >
> > > > > Below are more detailed comments:
> > > > >
> > > > > 4. From the example it seems "TxnStartOffset" should be included in
> > the
> > > > > DescribeTransaction response schema? Otherwise the user would not
> get
> > > it
> > > > in
> > > > > the following WriteTxnMarker request.
> > > > >
> > > > > 5. It is a bit easier for readers to highlight the added fields in
> > the
> > > > > existing WriteTxnMarkerRequest (btw I read is that we are only
> adding
> > > > > "Partitions" with the starting offset, right?). Also as for its
> > > response
> > > > it
> > > > > seems we do not make any schema changes except adding one more
> > > potential
> > > > > error code "INVALID_TXN_STATE" to it, right? If that's the case we
> > can
> > > > just
> > > > > state that explicitly.
> > > > >
> > > > > 6. It is not clear to me for the overloaded function that the
> > following
> > > > > option classes are not specified, what should be the default
> options?
> > > > >
> > > > > * ListTransactionsOptions
> > > > > * DescribeTransactionsOptions
> > > > > * DescribeProducersOptions
> > > > >
> > > > > Also, it seems AbortTransactionOptions would just be empty? If yes
> do
> > > we
> > > > > really need this option class for now?
> > > > >
> > > > > 7. A couple questions from the cmd tool examples:
> > > > > 7.1 Is "--broker" a required or optional (in that case I presume we
> > > would
> > > > > just query all brokers iteratively) in "--find-hanging"?
> > > > > 7.2 Seems "list-producers" is not exposed as a standalone feature
> in
> > > the
> > > > > cmd but only used in the wrapping "--find-hanging", is that
> > > intentional?
> > > > > Personally I feel exposing a "--list-producers" may be useful too:
> if
> > > we
> > > > > believe the user has the right ACL, it is legitimate to return the
> > > > producer
> > > > > information to her anyways. But that is debatable in the meta point
> > 3)
> > > > > above.
> > > > > 7.3 "Describing Transactions": we should also explain how that
> would
> > be
> > > > > executed, e.g. at least we should clarify that we would first find
> > the
> > > > > coordinator based on the transactional.id and hence users do not
> > need
> > > to
> > > > > specify one.
> > > > > 7.4. In "Aborting Transactions", should we also specify the
> > "--broker"
> > > > node
> > > > > as a required option? Otherwise we would not know which broker to
> > send
> > > > to.
> > > > >
> > > > >
> > > > > Overall, nice written one, thanks Jason.
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <
> > lucas@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > >> Would it be worth returning transactional.id.expiration.ms in
> > the
> > > > > > DescribeProducersResponse?
> > > > > >
> > > > > > > That's an interesting thought as well. Are you trying to avoid
> > the
> > > > need
> > > > > > to
> > > > > > specify it through the command line? The tool could also query
> the
> > > > value
> > > > > > with DescribeConfigs I suppose.
> > > > > >
> > > > > > Basically. I'm not sure how useful this will be in practice,
> though
> > > it
> > > > > > might help when debugging.
> > > > > >
> > > > > > Lucas
> > > > > >
> > > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <
> > jason@confluent.io
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Lucas,
> > > > > > >
> > > > > > > Thanks for the comments. Responses below:
> > > > > > >
> > > > > > > > Given that it's possible for replica producer states to
> diverge
> > > > from
> > > > > > each
> > > > > > > other, it would be very useful if
> > > DescribeProducers(Request,Response)
> > > > > and
> > > > > > > tooling is able to query all partition replicas for their
> > producers
> > > > > > >
> > > > > > > Yes, it makes sense to me to let DescribeProducers work on both
> > > > > followers
> > > > > > > and leaders. In fact, I'm encouraged that there are use cases
> for
> > > > this
> > > > > > work
> > > > > > > other than detecting hanging transactions. That was indeed the
> > > hope,
> > > > > but
> > > > > > I
> > > > > > > didn't have anything specific in mind. I will update the
> > proposal.
> > > > > > >
> > > > > > > > Would it be worth returning transactional.id.expiration.ms
> in
> > > the
> > > > > > > DescribeProducersResponse?
> > > > > > >
> > > > > > > That's an interesting thought as well. Are you trying to avoid
> > the
> > > > need
> > > > > > to
> > > > > > > specify it through the command line? The tool could also query
> > the
> > > > > value
> > > > > > > with DescribeConfigs I suppose.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > > > lucas@confluent.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jason,
> > > > > > > >
> > > > > > > > This looks like a very useful tool, thanks for writing it up.
> > > > > > > >
> > > > > > > > Given that it's possible for replica producer states to
> diverge
> > > > from
> > > > > > each
> > > > > > > > other, it would be very useful if
> > > > DescribeProducers(Request,Response)
> > > > > > and
> > > > > > > > tooling is able to query all partition replicas for their
> > > > producers.
> > > > > > One
> > > > > > > > way I can see this being used immediately is in kafka's
> system
> > > > tests,
> > > > > > > > especially the ones that inject failures. At the end of the
> > test
> > > we
> > > > > can
> > > > > > > > query all replicas and make sure that their states have not
> > > > > diverged. I
> > > > > > > can
> > > > > > > > also see it being useful when debugging production clusters
> > too.
> > > > > > > >
> > > > > > > > Would it be worth returning transactional.id.expiration.ms
> in
> > > the
> > > > > > > > DescribeProducersResponse?
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > >
> > > > > > > > Lucas
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> > > rndgstn@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > > > > > > >
> > > > > > > > > Ron
> > > > > > > > >
> > > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > > > jason@confluent.io>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hey Ron,
> > > > > > > > > >
> > > > > > > > > > We do not typically backport new APIs to older versions.
> I
> > > > think
> > > > > we
> > > > > > > can
> > > > > > > > > > however make the --abort command compatible with older
> > > > versions.
> > > > > It
> > > > > > > > would
> > > > > > > > > > require a user to do some analysis on their own to
> > identify a
> > > > > > hanging
> > > > > > > > > > transaction, but then they can use the tool from a new
> > > release
> > > > to
> > > > > > > > > recover.
> > > > > > > > > > For example, users could detect a hanging transaction
> > through
> > > > the
> > > > > > > > > existing
> > > > > > > > > > "LastStableOffsetLag" metric and then collect the needed
> > > > > > information
> > > > > > > > > from a
> > > > > > > > > > dump of the log (or producer snapshot). It's more work,
> but
> > > at
> > > > > > least
> > > > > > > > it's
> > > > > > > > > > possible. Does that sound fair?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Jason
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> > > > > rndgstn@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > > > > > > >
> > > > > > > > > > > Will the implementation be backported to prior Kafka
> > > > versions?
> > > > > > The
> > > > > > > > > > reason
> > > > > > > > > > > I ask is because if it is not backported and similar
> > > > > > functionality
> > > > > > > is
> > > > > > > > > not
> > > > > > > > > > > otherwise made available for older versions, then the
> > only
> > > > > > recourse
> > > > > > > > > > (aside
> > > > > > > > > > > from deleting and recreating the topic as you pointed
> > out)
> > > > may
> > > > > be
> > > > > > > to
> > > > > > > > > > > upgrade to 2.7 (or whatever version ends up getting
> this
> > > > > > > > > functionality).
> > > > > > > > > > > Such an upgrade may not be desirable, especially if the
> > > > number
> > > > > of
> > > > > > > > > > > intermediate versions is considerable. I understand the
> > > > mantra
> > > > > of
> > > > > > > > > "never
> > > > > > > > > > > fall too many versions behind" but the reality of it is
> > > that
> > > > it
> > > > > > > isn't
> > > > > > > > > > > always the case.  Even if the version is relatively
> > recent,
> > > > an
> > > > > > > > upgrade
> > > > > > > > > > may
> > > > > > > > > > > still not be possible for some time, and a quicker
> > > resolution
> > > > > may
> > > > > > > be
> > > > > > > > > > > necessary.
> > > > > > > > > > >
> > > > > > > > > > > Ron
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > > > > > jason@confluent.io>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi All,
> > > > > > > > > > > >
> > > > > > > > > > > > I've added a proposal to handle the problem of
> hanging
> > > > > > > > transactions:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > > > .
> > > > > > > > > > > > In theory, this should never happen. In practice, we
> > have
> > > > hit
> > > > > > one
> > > > > > > > bug
> > > > > > > > > > > where
> > > > > > > > > > > > it was possible and there are few good options today
> to
> > > > > > recover.
> > > > > > > > > Take a
> > > > > > > > > > > > look and let me know what you think.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Jason
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Jason Gustafson <ja...@confluent.io>.
> Hmm, but the "TxnStartOffset" is not included in the DescribeProducers
response either?

Oh, I accidentally called it `CurrentTxnStartTimestamp` in the schema.
Fixed now!

-Jason

On Mon, Aug 31, 2020 at 3:04 PM Guozhang Wang <wa...@gmail.com> wrote:

> On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Guozhang,
> >
> > Thanks for the detailed comments. Responses inline:
> >
> > > 1. I'd like to clarify how we can make "--abort" work with old brokers,
> > since without the additional field "Partitions" the tool needs to set the
> > coordinator epoch correctly instead of "-1"? Arguably that's still doable
> > but would require different call paths, and it's not clear whether that's
> > worth doing for old versions.
> >
> > That's a good question. What I had in mind was to write the marker using
> > the last coordinator epoch that was used by the respective ProducerId. I
> > realized that I left the coordinator epoch out of the `DescribeProducers`
> > response, so I have updated the KIP to include it. It is possible that
> > there is no coordinator epoch associated with a given ProducerId (e.g. if
> > it is the first transaction from that producer), but in this case we can
> > use 0.
> >
> > As for whether this is worth doing, I guess I would be more inclined to
> > leave it out if users had a reasonable alternative today to address this
> > problem.
> >
> > > 2. Why do we have to enforce "DescribeProducers" to be sent to only
> > leaders
> > while ListTransactions can be sent to any brokers? Or is it really
> > "ListTransactions to be sent to coordinators only"? From the workflow
> > you've described, based on the results back from DescribeProducers, we
> > should just immediately send ListTransactions to the
> > corresponding coordinators based on the collected producer ids, instead
> of
> > trying to send to any brokers right?
> >
> > I'm going to change `DescribeProducers` so that it can be handled by any
> > replica of a topic partition. This was suggested by Lucas in order to
> allow
> > this API to be used for replica consistency testing. As far as
> > `ListTransactions`, I was treating this similarly to `ListGroups`.
> Although
> > we know that the coordinators are the leaders of the __transaction_state
> > partitions, this is more of an implementation detail. From an API
> > perspective, we say that any broker could be a transaction coordinator.
> >
> > > 3. One thing I'm a bit hesitant about is that, is `Describe` permission
> > on
> > the associated topic sufficient to allow any users to get all producer
> > information writing to the specific topic-partitions including last
> > timestamp, txn-start-timestamp etc, which may be considered sensitive?
> > Should we require "ClusterAction" to only allow operators only?
> >
> > That's a fair point. Do you think `Read` permission would be reasonable?
> > This is all information that could be obtained by reading the topic.
> >
> > Yeah that makes sense.
>
>
> > > 4. From the example it seems "TxnStartOffset" should be included in the
> > DescribeTransaction response schema? Otherwise the user would not get it
> in
> > the following WriteTxnMarker request.
> >
> > The `DescribeTransaction` API is sent to the transaction coordinator,
> which
> > does not know the start offset of a transaction in each topic partition.
> > That is why we need `DescribeProducers`.
> >
>
> Hmm, but the "TxnStartOffset" is not included in the DescribeProducers
> response either?
>
>
> >
> > > 5. It is a bit easier for readers to highlight the added fields in the
> > existing WriteTxnMarkerRequest (btw I read is that we are only adding
> > "Partitions" with the starting offset, right?). Also as for its response
> it
> > seems we do not make any schema changes except adding one more potential
> > error code "INVALID_TXN_STATE" to it, right? If that's the case we can
> just
> > state that explicitly.
> >
> > I highlighted the new field in the request. For the response, the KIP
> > states the following: "There are no changes to the response schema, but
> it
> > will be bumped. Note that we are also enabling flexible version support."
> >
> > > 6. It is not clear to me for the overloaded function that the following
> > option classes are not specified, what should be the default options?
> > ...
> >
> > I was just trying to stick with existing conventions, but I will add some
> > more detail here. I think we should probably still include
> > `AbortTransactionOptions`. The `Options` classes are how users override
> > timeouts.
> >
> > > 7.1 Is "--broker" a required or optional (in that case I presume we
> would
> > just query all brokers iteratively) in "--find-hanging"?
> >
> > I think it should be required as a reasonable way to limit the scope of
> the
> > search. This is meant to be guided by metrics after all. If we do not
> limit
> > the scope to a single broker, then the behavior might get worse as the
> > cluster grows. I will clarify this.
> >
> > > 7.2 Seems "list-producers" is not exposed as a standalone feature in
> the
> > cmd but only used in the wrapping "--find-hanging", is that intentional?
> > Personally I feel exposing a "--list-producers" may be useful too: if we
> > believe the user has the right ACL, it is legitimate to return the
> producer
> > information to her anyways. But that is debatable in the meta point 3)
> > above.
> >
> > Yeah, I was planning to add this to support the use case that Lucas
> > mentioned. There is some awkwardness since it is a little difficult to
> > convey different sources of information through the same command. I guess
> > we can do `--list producers` and `--list transactions` and explain in the
> > documentation. Maybe that is good enough.
> >
> > > 7.3 "Describing Transactions": we should also explain how that would be
> > executed, e.g. at least we should clarify that we would first find the
> > coordinator based on the transactional.id and hence users do not need to
> > specify one.
> >
> > Sure, makes sense.
> >
> > > 7.4. In "Aborting Transactions", should we also specify the "--broker"
> > node
> > as a required option? Otherwise we would not know which broker to send
> to.
> >
> > The --topic and --partition arguments are required, so the target is
> always
> > the leader of that partition.
> >
> >
> > Thanks,
> > Jason
> >
> >
> >
> > On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <bob.barrett@confluent.io
> >
> > wrote:
> >
> > > Hi Jason,
> > >
> > > Thanks for this KIP, I think this will be a huge operational
> improvement
> > > and overall it looks great to me.
> > >
> > > I'm not sure how much value the MaxActiveTransactionDuration metric
> adds,
> > > given that we have the --find-hanging option in the tool. As you
> mention,
> > > instances of these transactions are expected to be rare, and a
> > > partition-level metric, which can generate a lot of data, seems very
> > > heavyweight for such a rare occurrence. I think "alert on
> > > PartitionsWithLateTransactionsCount" followed by "run
> kafka-transactions
> > > --find-hanging on the relevant broker" is a reasonable process for
> > cluster
> > > operators to follow.
> > >
> > > Thanks,
> > > Bob
> > >
> > > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > Thanks for the written KIP. I think this is going to be a very useful
> > > tool
> > > > for operational improvements since with eos in its current stage, we
> > > cannot
> > > > confidently assert that we are bug-free, and even in the future when
> we
> > > are
> > > > confident this is still going to be leveraged by older versioned
> > brokers.
> > > > Regarding the solution, I've also debated myself whether Kafka should
> > > > "self-heal" automatically when detected in such situations, or should
> > we
> > > > instead build into ecosystem tooling to let operators do it. And I've
> > > also
> > > > convinced myself that the latter should be a better solution to keep
> > > Kafka
> > > > software itself simpler.
> > > >
> > > > Regarding the KIP itself, I have a few meta comments below:
> > > >
> > > > 1. I'd like to clarify how we can make "--abort" work with old
> brokers,
> > > > since without the additional field "Partitions" the tool needs to set
> > the
> > > > coordinator epoch correctly instead of "-1"? Arguably that's still
> > doable
> > > > but would require different call paths, and it's not clear whether
> > that's
> > > > worth doing for old versions.
> > > >
> > > > 2. Why do we have to enforce "DescribeProducers" to be sent to only
> > > leaders
> > > > while ListTransactions can be sent to any brokers? Or is it really
> > > > "ListTransactions to be sent to coordinators only"? From the workflow
> > > > you've described, based on the results back from DescribeProducers,
> we
> > > > should just immediately send ListTransactions to the
> > > > corresponding coordinators based on the collected producer ids,
> instead
> > > of
> > > > trying to send to any brokers right?
> > > >
> > > > Also I'm a bit concerned if "ListTransactions" could potentially
> return
> > > too
> > > > much data with "StateFilters" set to all states, including completed
> > > ones.
> > > > Do we expect users ever want to know transactions that are not
> pending?
> > > On
> > > > the other hand, maybe we can just require users to specify the
> "pids[]"
> > > in
> > > > this request too to further filter those un-interested transactions.
> > This
> > > > also works well with the workflow: we know exactly from
> > > "DescribeProducers"
> > > > which pids are we diagnosing right now, so in the follow-up
> > > > "ListTransactions" we should also only care for those partitions
> only.
> > > >
> > > > 3. One thing I'm a bit hesitant about is that, is `Describe`
> permission
> > > on
> > > > the associated topic sufficient to allow any users to get all
> producer
> > > > information writing to the specific topic-partitions including last
> > > > timestamp, txn-start-timestamp etc, which may be considered
> sensitive?
> > > > Should we require "ClusterAction" to only allow operators only?
> > > >
> > > > Below are more detailed comments:
> > > >
> > > > 4. From the example it seems "TxnStartOffset" should be included in
> the
> > > > DescribeTransaction response schema? Otherwise the user would not get
> > it
> > > in
> > > > the following WriteTxnMarker request.
> > > >
> > > > 5. It is a bit easier for readers to highlight the added fields in
> the
> > > > existing WriteTxnMarkerRequest (btw I read is that we are only adding
> > > > "Partitions" with the starting offset, right?). Also as for its
> > response
> > > it
> > > > seems we do not make any schema changes except adding one more
> > potential
> > > > error code "INVALID_TXN_STATE" to it, right? If that's the case we
> can
> > > just
> > > > state that explicitly.
> > > >
> > > > 6. It is not clear to me for the overloaded function that the
> following
> > > > option classes are not specified, what should be the default options?
> > > >
> > > > * ListTransactionsOptions
> > > > * DescribeTransactionsOptions
> > > > * DescribeProducersOptions
> > > >
> > > > Also, it seems AbortTransactionOptions would just be empty? If yes do
> > we
> > > > really need this option class for now?
> > > >
> > > > 7. A couple questions from the cmd tool examples:
> > > > 7.1 Is "--broker" a required or optional (in that case I presume we
> > would
> > > > just query all brokers iteratively) in "--find-hanging"?
> > > > 7.2 Seems "list-producers" is not exposed as a standalone feature in
> > the
> > > > cmd but only used in the wrapping "--find-hanging", is that
> > intentional?
> > > > Personally I feel exposing a "--list-producers" may be useful too: if
> > we
> > > > believe the user has the right ACL, it is legitimate to return the
> > > producer
> > > > information to her anyways. But that is debatable in the meta point
> 3)
> > > > above.
> > > > 7.3 "Describing Transactions": we should also explain how that would
> be
> > > > executed, e.g. at least we should clarify that we would first find
> the
> > > > coordinator based on the transactional.id and hence users do not
> need
> > to
> > > > specify one.
> > > > 7.4. In "Aborting Transactions", should we also specify the
> "--broker"
> > > node
> > > > as a required option? Otherwise we would not know which broker to
> send
> > > to.
> > > >
> > > >
> > > > Overall, nice written one, thanks Jason.
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <
> lucas@confluent.io>
> > > > wrote:
> > > >
> > > > > >> Would it be worth returning transactional.id.expiration.ms in
> the
> > > > > DescribeProducersResponse?
> > > > >
> > > > > > That's an interesting thought as well. Are you trying to avoid
> the
> > > need
> > > > > to
> > > > > specify it through the command line? The tool could also query the
> > > value
> > > > > with DescribeConfigs I suppose.
> > > > >
> > > > > Basically. I'm not sure how useful this will be in practice, though
> > it
> > > > > might help when debugging.
> > > > >
> > > > > Lucas
> > > > >
> > > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <
> jason@confluent.io
> > >
> > > > > wrote:
> > > > >
> > > > > > Hey Lucas,
> > > > > >
> > > > > > Thanks for the comments. Responses below:
> > > > > >
> > > > > > > Given that it's possible for replica producer states to diverge
> > > from
> > > > > each
> > > > > > other, it would be very useful if
> > DescribeProducers(Request,Response)
> > > > and
> > > > > > tooling is able to query all partition replicas for their
> producers
> > > > > >
> > > > > > Yes, it makes sense to me to let DescribeProducers work on both
> > > > followers
> > > > > > and leaders. In fact, I'm encouraged that there are use cases for
> > > this
> > > > > work
> > > > > > other than detecting hanging transactions. That was indeed the
> > hope,
> > > > but
> > > > > I
> > > > > > didn't have anything specific in mind. I will update the
> proposal.
> > > > > >
> > > > > > > Would it be worth returning transactional.id.expiration.ms in
> > the
> > > > > > DescribeProducersResponse?
> > > > > >
> > > > > > That's an interesting thought as well. Are you trying to avoid
> the
> > > need
> > > > > to
> > > > > > specify it through the command line? The tool could also query
> the
> > > > value
> > > > > > with DescribeConfigs I suppose.
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > > lucas@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jason,
> > > > > > >
> > > > > > > This looks like a very useful tool, thanks for writing it up.
> > > > > > >
> > > > > > > Given that it's possible for replica producer states to diverge
> > > from
> > > > > each
> > > > > > > other, it would be very useful if
> > > DescribeProducers(Request,Response)
> > > > > and
> > > > > > > tooling is able to query all partition replicas for their
> > > producers.
> > > > > One
> > > > > > > way I can see this being used immediately is in kafka's system
> > > tests,
> > > > > > > especially the ones that inject failures. At the end of the
> test
> > we
> > > > can
> > > > > > > query all replicas and make sure that their states have not
> > > > diverged. I
> > > > > > can
> > > > > > > also see it being useful when debugging production clusters
> too.
> > > > > > >
> > > > > > > Would it be worth returning transactional.id.expiration.ms in
> > the
> > > > > > > DescribeProducersResponse?
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Lucas
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> > rndgstn@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > > > > > >
> > > > > > > > Ron
> > > > > > > >
> > > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > > jason@confluent.io>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hey Ron,
> > > > > > > > >
> > > > > > > > > We do not typically backport new APIs to older versions. I
> > > think
> > > > we
> > > > > > can
> > > > > > > > > however make the --abort command compatible with older
> > > versions.
> > > > It
> > > > > > > would
> > > > > > > > > require a user to do some analysis on their own to
> identify a
> > > > > hanging
> > > > > > > > > transaction, but then they can use the tool from a new
> > release
> > > to
> > > > > > > > recover.
> > > > > > > > > For example, users could detect a hanging transaction
> through
> > > the
> > > > > > > > existing
> > > > > > > > > "LastStableOffsetLag" metric and then collect the needed
> > > > > information
> > > > > > > > from a
> > > > > > > > > dump of the log (or producer snapshot). It's more work, but
> > at
> > > > > least
> > > > > > > it's
> > > > > > > > > possible. Does that sound fair?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Jason
> > > > > > > > >
> > > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> > > > rndgstn@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > > > > > >
> > > > > > > > > > Will the implementation be backported to prior Kafka
> > > versions?
> > > > > The
> > > > > > > > > reason
> > > > > > > > > > I ask is because if it is not backported and similar
> > > > > functionality
> > > > > > is
> > > > > > > > not
> > > > > > > > > > otherwise made available for older versions, then the
> only
> > > > > recourse
> > > > > > > > > (aside
> > > > > > > > > > from deleting and recreating the topic as you pointed
> out)
> > > may
> > > > be
> > > > > > to
> > > > > > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > > > > > functionality).
> > > > > > > > > > Such an upgrade may not be desirable, especially if the
> > > number
> > > > of
> > > > > > > > > > intermediate versions is considerable. I understand the
> > > mantra
> > > > of
> > > > > > > > "never
> > > > > > > > > > fall too many versions behind" but the reality of it is
> > that
> > > it
> > > > > > isn't
> > > > > > > > > > always the case.  Even if the version is relatively
> recent,
> > > an
> > > > > > > upgrade
> > > > > > > > > may
> > > > > > > > > > still not be possible for some time, and a quicker
> > resolution
> > > > may
> > > > > > be
> > > > > > > > > > necessary.
> > > > > > > > > >
> > > > > > > > > > Ron
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > > > > jason@confluent.io>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > > > I've added a proposal to handle the problem of hanging
> > > > > > > transactions:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > > .
> > > > > > > > > > > In theory, this should never happen. In practice, we
> have
> > > hit
> > > > > one
> > > > > > > bug
> > > > > > > > > > where
> > > > > > > > > > > it was possible and there are few good options today to
> > > > > recover.
> > > > > > > > Take a
> > > > > > > > > > > look and let me know what you think.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Jason
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Guozhang Wang <wa...@gmail.com>.
On Mon, Aug 31, 2020 at 12:28 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Guozhang,
>
> Thanks for the detailed comments. Responses inline:
>
> > 1. I'd like to clarify how we can make "--abort" work with old brokers,
> since without the additional field "Partitions" the tool needs to set the
> coordinator epoch correctly instead of "-1"? Arguably that's still doable
> but would require different call paths, and it's not clear whether that's
> worth doing for old versions.
>
> That's a good question. What I had in mind was to write the marker using
> the last coordinator epoch that was used by the respective ProducerId. I
> realized that I left the coordinator epoch out of the `DescribeProducers`
> response, so I have updated the KIP to include it. It is possible that
> there is no coordinator epoch associated with a given ProducerId (e.g. if
> it is the first transaction from that producer), but in this case we can
> use 0.
>
> As for whether this is worth doing, I guess I would be more inclined to
> leave it out if users had a reasonable alternative today to address this
> problem.
>
> > 2. Why do we have to enforce "DescribeProducers" to be sent to only
> leaders
> while ListTransactions can be sent to any brokers? Or is it really
> "ListTransactions to be sent to coordinators only"? From the workflow
> you've described, based on the results back from DescribeProducers, we
> should just immediately send ListTransactions to the
> corresponding coordinators based on the collected producer ids, instead of
> trying to send to any brokers right?
>
> I'm going to change `DescribeProducers` so that it can be handled by any
> replica of a topic partition. This was suggested by Lucas in order to allow
> this API to be used for replica consistency testing. As far as
> `ListTransactions`, I was treating this similarly to `ListGroups`. Although
> we know that the coordinators are the leaders of the __transaction_state
> partitions, this is more of an implementation detail. From an API
> perspective, we say that any broker could be a transaction coordinator.
>
> > 3. One thing I'm a bit hesitant about is that, is `Describe` permission
> on
> the associated topic sufficient to allow any users to get all producer
> information writing to the specific topic-partitions including last
> timestamp, txn-start-timestamp etc, which may be considered sensitive?
> Should we require "ClusterAction" to only allow operators only?
>
> That's a fair point. Do you think `Read` permission would be reasonable?
> This is all information that could be obtained by reading the topic.
>
> Yeah that makes sense.


> > 4. From the example it seems "TxnStartOffset" should be included in the
> DescribeTransaction response schema? Otherwise the user would not get it in
> the following WriteTxnMarker request.
>
> The `DescribeTransaction` API is sent to the transaction coordinator, which
> does not know the start offset of a transaction in each topic partition.
> That is why we need `DescribeProducers`.
>

Hmm, but the "TxnStartOffset" is not included in the DescribeProducers
response either?


>
> > 5. It is a bit easier for readers to highlight the added fields in the
> existing WriteTxnMarkerRequest (btw I read is that we are only adding
> "Partitions" with the starting offset, right?). Also as for its response it
> seems we do not make any schema changes except adding one more potential
> error code "INVALID_TXN_STATE" to it, right? If that's the case we can just
> state that explicitly.
>
> I highlighted the new field in the request. For the response, the KIP
> states the following: "There are no changes to the response schema, but it
> will be bumped. Note that we are also enabling flexible version support."
>
> > 6. It is not clear to me for the overloaded function that the following
> option classes are not specified, what should be the default options?
> ...
>
> I was just trying to stick with existing conventions, but I will add some
> more detail here. I think we should probably still include
> `AbortTransactionOptions`. The `Options` classes are how users override
> timeouts.
>
> > 7.1 Is "--broker" a required or optional (in that case I presume we would
> just query all brokers iteratively) in "--find-hanging"?
>
> I think it should be required as a reasonable way to limit the scope of the
> search. This is meant to be guided by metrics after all. If we do not limit
> the scope to a single broker, then the behavior might get worse as the
> cluster grows. I will clarify this.
>
> > 7.2 Seems "list-producers" is not exposed as a standalone feature in the
> cmd but only used in the wrapping "--find-hanging", is that intentional?
> Personally I feel exposing a "--list-producers" may be useful too: if we
> believe the user has the right ACL, it is legitimate to return the producer
> information to her anyways. But that is debatable in the meta point 3)
> above.
>
> Yeah, I was planning to add this to support the use case that Lucas
> mentioned. There is some awkwardness since it is a little difficult to
> convey different sources of information through the same command. I guess
> we can do `--list producers` and `--list transactions` and explain in the
> documentation. Maybe that is good enough.
>
> > 7.3 "Describing Transactions": we should also explain how that would be
> executed, e.g. at least we should clarify that we would first find the
> coordinator based on the transactional.id and hence users do not need to
> specify one.
>
> Sure, makes sense.
>
> > 7.4. In "Aborting Transactions", should we also specify the "--broker"
> node
> as a required option? Otherwise we would not know which broker to send to.
>
> The --topic and --partition arguments are required, so the target is always
> the leader of that partition.
>
>
> Thanks,
> Jason
>
>
>
> On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <bo...@confluent.io>
> wrote:
>
> > Hi Jason,
> >
> > Thanks for this KIP, I think this will be a huge operational improvement
> > and overall it looks great to me.
> >
> > I'm not sure how much value the MaxActiveTransactionDuration metric adds,
> > given that we have the --find-hanging option in the tool. As you mention,
> > instances of these transactions are expected to be rare, and a
> > partition-level metric, which can generate a lot of data, seems very
> > heavyweight for such a rare occurrence. I think "alert on
> > PartitionsWithLateTransactionsCount" followed by "run kafka-transactions
> > --find-hanging on the relevant broker" is a reasonable process for
> cluster
> > operators to follow.
> >
> > Thanks,
> > Bob
> >
> > On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Hi Jason,
> > >
> > > Thanks for the written KIP. I think this is going to be a very useful
> > tool
> > > for operational improvements since with eos in its current stage, we
> > cannot
> > > confidently assert that we are bug-free, and even in the future when we
> > are
> > > confident this is still going to be leveraged by older versioned
> brokers.
> > > Regarding the solution, I've also debated myself whether Kafka should
> > > "self-heal" automatically when detected in such situations, or should
> we
> > > instead build into ecosystem tooling to let operators do it. And I've
> > also
> > > convinced myself that the latter should be a better solution to keep
> > Kafka
> > > software itself simpler.
> > >
> > > Regarding the KIP itself, I have a few meta comments below:
> > >
> > > 1. I'd like to clarify how we can make "--abort" work with old brokers,
> > > since without the additional field "Partitions" the tool needs to set
> the
> > > coordinator epoch correctly instead of "-1"? Arguably that's still
> doable
> > > but would require different call paths, and it's not clear whether
> that's
> > > worth doing for old versions.
> > >
> > > 2. Why do we have to enforce "DescribeProducers" to be sent to only
> > leaders
> > > while ListTransactions can be sent to any brokers? Or is it really
> > > "ListTransactions to be sent to coordinators only"? From the workflow
> > > you've described, based on the results back from DescribeProducers, we
> > > should just immediately send ListTransactions to the
> > > corresponding coordinators based on the collected producer ids, instead
> > of
> > > trying to send to any brokers right?
> > >
> > > Also I'm a bit concerned if "ListTransactions" could potentially return
> > too
> > > much data with "StateFilters" set to all states, including completed
> > ones.
> > > Do we expect users ever want to know transactions that are not pending?
> > On
> > > the other hand, maybe we can just require users to specify the "pids[]"
> > in
> > > this request too to further filter those un-interested transactions.
> This
> > > also works well with the workflow: we know exactly from
> > "DescribeProducers"
> > > which pids are we diagnosing right now, so in the follow-up
> > > "ListTransactions" we should also only care for those partitions only.
> > >
> > > 3. One thing I'm a bit hesitant about is that, is `Describe` permission
> > on
> > > the associated topic sufficient to allow any users to get all producer
> > > information writing to the specific topic-partitions including last
> > > timestamp, txn-start-timestamp etc, which may be considered sensitive?
> > > Should we require "ClusterAction" to only allow operators only?
> > >
> > > Below are more detailed comments:
> > >
> > > 4. From the example it seems "TxnStartOffset" should be included in the
> > > DescribeTransaction response schema? Otherwise the user would not get
> it
> > in
> > > the following WriteTxnMarker request.
> > >
> > > 5. It is a bit easier for readers to highlight the added fields in the
> > > existing WriteTxnMarkerRequest (btw I read is that we are only adding
> > > "Partitions" with the starting offset, right?). Also as for its
> response
> > it
> > > seems we do not make any schema changes except adding one more
> potential
> > > error code "INVALID_TXN_STATE" to it, right? If that's the case we can
> > just
> > > state that explicitly.
> > >
> > > 6. It is not clear to me for the overloaded function that the following
> > > option classes are not specified, what should be the default options?
> > >
> > > * ListTransactionsOptions
> > > * DescribeTransactionsOptions
> > > * DescribeProducersOptions
> > >
> > > Also, it seems AbortTransactionOptions would just be empty? If yes do
> we
> > > really need this option class for now?
> > >
> > > 7. A couple questions from the cmd tool examples:
> > > 7.1 Is "--broker" a required or optional (in that case I presume we
> would
> > > just query all brokers iteratively) in "--find-hanging"?
> > > 7.2 Seems "list-producers" is not exposed as a standalone feature in
> the
> > > cmd but only used in the wrapping "--find-hanging", is that
> intentional?
> > > Personally I feel exposing a "--list-producers" may be useful too: if
> we
> > > believe the user has the right ACL, it is legitimate to return the
> > producer
> > > information to her anyways. But that is debatable in the meta point 3)
> > > above.
> > > 7.3 "Describing Transactions": we should also explain how that would be
> > > executed, e.g. at least we should clarify that we would first find the
> > > coordinator based on the transactional.id and hence users do not need
> to
> > > specify one.
> > > 7.4. In "Aborting Transactions", should we also specify the "--broker"
> > node
> > > as a required option? Otherwise we would not know which broker to send
> > to.
> > >
> > >
> > > Overall, nice written one, thanks Jason.
> > >
> > > Guozhang
> > >
> > >
> > > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
> > > wrote:
> > >
> > > > >> Would it be worth returning transactional.id.expiration.ms in the
> > > > DescribeProducersResponse?
> > > >
> > > > > That's an interesting thought as well. Are you trying to avoid the
> > need
> > > > to
> > > > specify it through the command line? The tool could also query the
> > value
> > > > with DescribeConfigs I suppose.
> > > >
> > > > Basically. I'm not sure how useful this will be in practice, though
> it
> > > > might help when debugging.
> > > >
> > > > Lucas
> > > >
> > > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <jason@confluent.io
> >
> > > > wrote:
> > > >
> > > > > Hey Lucas,
> > > > >
> > > > > Thanks for the comments. Responses below:
> > > > >
> > > > > > Given that it's possible for replica producer states to diverge
> > from
> > > > each
> > > > > other, it would be very useful if
> DescribeProducers(Request,Response)
> > > and
> > > > > tooling is able to query all partition replicas for their producers
> > > > >
> > > > > Yes, it makes sense to me to let DescribeProducers work on both
> > > followers
> > > > > and leaders. In fact, I'm encouraged that there are use cases for
> > this
> > > > work
> > > > > other than detecting hanging transactions. That was indeed the
> hope,
> > > but
> > > > I
> > > > > didn't have anything specific in mind. I will update the proposal.
> > > > >
> > > > > > Would it be worth returning transactional.id.expiration.ms in
> the
> > > > > DescribeProducersResponse?
> > > > >
> > > > > That's an interesting thought as well. Are you trying to avoid the
> > need
> > > > to
> > > > > specify it through the command line? The tool could also query the
> > > value
> > > > > with DescribeConfigs I suppose.
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> > lucas@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hi Jason,
> > > > > >
> > > > > > This looks like a very useful tool, thanks for writing it up.
> > > > > >
> > > > > > Given that it's possible for replica producer states to diverge
> > from
> > > > each
> > > > > > other, it would be very useful if
> > DescribeProducers(Request,Response)
> > > > and
> > > > > > tooling is able to query all partition replicas for their
> > producers.
> > > > One
> > > > > > way I can see this being used immediately is in kafka's system
> > tests,
> > > > > > especially the ones that inject failures. At the end of the test
> we
> > > can
> > > > > > query all replicas and make sure that their states have not
> > > diverged. I
> > > > > can
> > > > > > also see it being useful when debugging production clusters too.
> > > > > >
> > > > > > Would it be worth returning transactional.id.expiration.ms in
> the
> > > > > > DescribeProducersResponse?
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Lucas
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <
> rndgstn@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > > > > >
> > > > > > > Ron
> > > > > > >
> > > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > > jason@confluent.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hey Ron,
> > > > > > > >
> > > > > > > > We do not typically backport new APIs to older versions. I
> > think
> > > we
> > > > > can
> > > > > > > > however make the --abort command compatible with older
> > versions.
> > > It
> > > > > > would
> > > > > > > > require a user to do some analysis on their own to identify a
> > > > hanging
> > > > > > > > transaction, but then they can use the tool from a new
> release
> > to
> > > > > > > recover.
> > > > > > > > For example, users could detect a hanging transaction through
> > the
> > > > > > > existing
> > > > > > > > "LastStableOffsetLag" metric and then collect the needed
> > > > information
> > > > > > > from a
> > > > > > > > dump of the log (or producer snapshot). It's more work, but
> at
> > > > least
> > > > > > it's
> > > > > > > > possible. Does that sound fair?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> > > rndgstn@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > > > > >
> > > > > > > > > Will the implementation be backported to prior Kafka
> > versions?
> > > > The
> > > > > > > > reason
> > > > > > > > > I ask is because if it is not backported and similar
> > > > functionality
> > > > > is
> > > > > > > not
> > > > > > > > > otherwise made available for older versions, then the only
> > > > recourse
> > > > > > > > (aside
> > > > > > > > > from deleting and recreating the topic as you pointed out)
> > may
> > > be
> > > > > to
> > > > > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > > > > functionality).
> > > > > > > > > Such an upgrade may not be desirable, especially if the
> > number
> > > of
> > > > > > > > > intermediate versions is considerable. I understand the
> > mantra
> > > of
> > > > > > > "never
> > > > > > > > > fall too many versions behind" but the reality of it is
> that
> > it
> > > > > isn't
> > > > > > > > > always the case.  Even if the version is relatively recent,
> > an
> > > > > > upgrade
> > > > > > > > may
> > > > > > > > > still not be possible for some time, and a quicker
> resolution
> > > may
> > > > > be
> > > > > > > > > necessary.
> > > > > > > > >
> > > > > > > > > Ron
> > > > > > > > >
> > > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > > > jason@confluent.io>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > I've added a proposal to handle the problem of hanging
> > > > > > transactions:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > > .
> > > > > > > > > > In theory, this should never happen. In practice, we have
> > hit
> > > > one
> > > > > > bug
> > > > > > > > > where
> > > > > > > > > > it was possible and there are few good options today to
> > > > recover.
> > > > > > > Take a
> > > > > > > > > > look and let me know what you think.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Jason
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Jason Gustafson <ja...@confluent.io>.
Hey Guozhang,

Thanks for the detailed comments. Responses inline:

> 1. I'd like to clarify how we can make "--abort" work with old brokers,
since without the additional field "Partitions" the tool needs to set the
coordinator epoch correctly instead of "-1"? Arguably that's still doable
but would require different call paths, and it's not clear whether that's
worth doing for old versions.

That's a good question. What I had in mind was to write the marker using
the last coordinator epoch that was used by the respective ProducerId. I
realized that I left the coordinator epoch out of the `DescribeProducers`
response, so I have updated the KIP to include it. It is possible that
there is no coordinator epoch associated with a given ProducerId (e.g. if
it is the first transaction from that producer), but in this case we can
use 0.

As for whether this is worth doing, I guess I would be more inclined to
leave it out if users had a reasonable alternative today to address this
problem.

> 2. Why do we have to enforce "DescribeProducers" to be sent to only
leaders
while ListTransactions can be sent to any brokers? Or is it really
"ListTransactions to be sent to coordinators only"? From the workflow
you've described, based on the results back from DescribeProducers, we
should just immediately send ListTransactions to the
corresponding coordinators based on the collected producer ids, instead of
trying to send to any brokers right?

I'm going to change `DescribeProducers` so that it can be handled by any
replica of a topic partition. This was suggested by Lucas in order to allow
this API to be used for replica consistency testing. As far as
`ListTransactions`, I was treating this similarly to `ListGroups`. Although
we know that the coordinators are the leaders of the __transaction_state
partitions, this is more of an implementation detail. From an API
perspective, we say that any broker could be a transaction coordinator.

> 3. One thing I'm a bit hesitant about is that, is `Describe` permission on
the associated topic sufficient to allow any users to get all producer
information writing to the specific topic-partitions including last
timestamp, txn-start-timestamp etc, which may be considered sensitive?
Should we require "ClusterAction" to only allow operators only?

That's a fair point. Do you think `Read` permission would be reasonable?
This is all information that could be obtained by reading the topic.

> 4. From the example it seems "TxnStartOffset" should be included in the
DescribeTransaction response schema? Otherwise the user would not get it in
the following WriteTxnMarker request.

The `DescribeTransaction` API is sent to the transaction coordinator, which
does not know the start offset of a transaction in each topic partition.
That is why we need `DescribeProducers`.

> 5. It is a bit easier for readers to highlight the added fields in the
existing WriteTxnMarkerRequest (btw I read is that we are only adding
"Partitions" with the starting offset, right?). Also as for its response it
seems we do not make any schema changes except adding one more potential
error code "INVALID_TXN_STATE" to it, right? If that's the case we can just
state that explicitly.

I highlighted the new field in the request. For the response, the KIP
states the following: "There are no changes to the response schema, but it
will be bumped. Note that we are also enabling flexible version support."

> 6. It is not clear to me for the overloaded function that the following
option classes are not specified, what should be the default options?
...

I was just trying to stick with existing conventions, but I will add some
more detail here. I think we should probably still include
`AbortTransactionOptions`. The `Options` classes are how users override
timeouts.

> 7.1 Is "--broker" a required or optional (in that case I presume we would
just query all brokers iteratively) in "--find-hanging"?

I think it should be required as a reasonable way to limit the scope of the
search. This is meant to be guided by metrics after all. If we do not limit
the scope to a single broker, then the behavior might get worse as the
cluster grows. I will clarify this.

> 7.2 Seems "list-producers" is not exposed as a standalone feature in the
cmd but only used in the wrapping "--find-hanging", is that intentional?
Personally I feel exposing a "--list-producers" may be useful too: if we
believe the user has the right ACL, it is legitimate to return the producer
information to her anyways. But that is debatable in the meta point 3)
above.

Yeah, I was planning to add this to support the use case that Lucas
mentioned. There is some awkwardness since it is a little difficult to
convey different sources of information through the same command. I guess
we can do `--list producers` and `--list transactions` and explain in the
documentation. Maybe that is good enough.

> 7.3 "Describing Transactions": we should also explain how that would be
executed, e.g. at least we should clarify that we would first find the
coordinator based on the transactional.id and hence users do not need to
specify one.

Sure, makes sense.

> 7.4. In "Aborting Transactions", should we also specify the "--broker"
node
as a required option? Otherwise we would not know which broker to send to.

The --topic and --partition arguments are required, so the target is always
the leader of that partition.


Thanks,
Jason



On Fri, Aug 28, 2020 at 8:13 AM Robert Barrett <bo...@confluent.io>
wrote:

> Hi Jason,
>
> Thanks for this KIP, I think this will be a huge operational improvement
> and overall it looks great to me.
>
> I'm not sure how much value the MaxActiveTransactionDuration metric adds,
> given that we have the --find-hanging option in the tool. As you mention,
> instances of these transactions are expected to be rare, and a
> partition-level metric, which can generate a lot of data, seems very
> heavyweight for such a rare occurrence. I think "alert on
> PartitionsWithLateTransactionsCount" followed by "run kafka-transactions
> --find-hanging on the relevant broker" is a reasonable process for cluster
> operators to follow.
>
> Thanks,
> Bob
>
> On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Hi Jason,
> >
> > Thanks for the written KIP. I think this is going to be a very useful
> tool
> > for operational improvements since with eos in its current stage, we
> cannot
> > confidently assert that we are bug-free, and even in the future when we
> are
> > confident this is still going to be leveraged by older versioned brokers.
> > Regarding the solution, I've also debated myself whether Kafka should
> > "self-heal" automatically when detected in such situations, or should we
> > instead build into ecosystem tooling to let operators do it. And I've
> also
> > convinced myself that the latter should be a better solution to keep
> Kafka
> > software itself simpler.
> >
> > Regarding the KIP itself, I have a few meta comments below:
> >
> > 1. I'd like to clarify how we can make "--abort" work with old brokers,
> > since without the additional field "Partitions" the tool needs to set the
> > coordinator epoch correctly instead of "-1"? Arguably that's still doable
> > but would require different call paths, and it's not clear whether that's
> > worth doing for old versions.
> >
> > 2. Why do we have to enforce "DescribeProducers" to be sent to only
> leaders
> > while ListTransactions can be sent to any brokers? Or is it really
> > "ListTransactions to be sent to coordinators only"? From the workflow
> > you've described, based on the results back from DescribeProducers, we
> > should just immediately send ListTransactions to the
> > corresponding coordinators based on the collected producer ids, instead
> of
> > trying to send to any brokers right?
> >
> > Also I'm a bit concerned if "ListTransactions" could potentially return
> too
> > much data with "StateFilters" set to all states, including completed
> ones.
> > Do we expect users ever want to know transactions that are not pending?
> On
> > the other hand, maybe we can just require users to specify the "pids[]"
> in
> > this request too to further filter those un-interested transactions. This
> > also works well with the workflow: we know exactly from
> "DescribeProducers"
> > which pids are we diagnosing right now, so in the follow-up
> > "ListTransactions" we should also only care for those partitions only.
> >
> > 3. One thing I'm a bit hesitant about is that, is `Describe` permission
> on
> > the associated topic sufficient to allow any users to get all producer
> > information writing to the specific topic-partitions including last
> > timestamp, txn-start-timestamp etc, which may be considered sensitive?
> > Should we require "ClusterAction" to only allow operators only?
> >
> > Below are more detailed comments:
> >
> > 4. From the example it seems "TxnStartOffset" should be included in the
> > DescribeTransaction response schema? Otherwise the user would not get it
> in
> > the following WriteTxnMarker request.
> >
> > 5. It is a bit easier for readers to highlight the added fields in the
> > existing WriteTxnMarkerRequest (btw I read is that we are only adding
> > "Partitions" with the starting offset, right?). Also as for its response
> it
> > seems we do not make any schema changes except adding one more potential
> > error code "INVALID_TXN_STATE" to it, right? If that's the case we can
> just
> > state that explicitly.
> >
> > 6. It is not clear to me for the overloaded function that the following
> > option classes are not specified, what should be the default options?
> >
> > * ListTransactionsOptions
> > * DescribeTransactionsOptions
> > * DescribeProducersOptions
> >
> > Also, it seems AbortTransactionOptions would just be empty? If yes do we
> > really need this option class for now?
> >
> > 7. A couple questions from the cmd tool examples:
> > 7.1 Is "--broker" a required or optional (in that case I presume we would
> > just query all brokers iteratively) in "--find-hanging"?
> > 7.2 Seems "list-producers" is not exposed as a standalone feature in the
> > cmd but only used in the wrapping "--find-hanging", is that intentional?
> > Personally I feel exposing a "--list-producers" may be useful too: if we
> > believe the user has the right ACL, it is legitimate to return the
> producer
> > information to her anyways. But that is debatable in the meta point 3)
> > above.
> > 7.3 "Describing Transactions": we should also explain how that would be
> > executed, e.g. at least we should clarify that we would first find the
> > coordinator based on the transactional.id and hence users do not need to
> > specify one.
> > 7.4. In "Aborting Transactions", should we also specify the "--broker"
> node
> > as a required option? Otherwise we would not know which broker to send
> to.
> >
> >
> > Overall, nice written one, thanks Jason.
> >
> > Guozhang
> >
> >
> > On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
> > wrote:
> >
> > > >> Would it be worth returning transactional.id.expiration.ms in the
> > > DescribeProducersResponse?
> > >
> > > > That's an interesting thought as well. Are you trying to avoid the
> need
> > > to
> > > specify it through the command line? The tool could also query the
> value
> > > with DescribeConfigs I suppose.
> > >
> > > Basically. I'm not sure how useful this will be in practice, though it
> > > might help when debugging.
> > >
> > > Lucas
> > >
> > > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hey Lucas,
> > > >
> > > > Thanks for the comments. Responses below:
> > > >
> > > > > Given that it's possible for replica producer states to diverge
> from
> > > each
> > > > other, it would be very useful if DescribeProducers(Request,Response)
> > and
> > > > tooling is able to query all partition replicas for their producers
> > > >
> > > > Yes, it makes sense to me to let DescribeProducers work on both
> > followers
> > > > and leaders. In fact, I'm encouraged that there are use cases for
> this
> > > work
> > > > other than detecting hanging transactions. That was indeed the hope,
> > but
> > > I
> > > > didn't have anything specific in mind. I will update the proposal.
> > > >
> > > > > Would it be worth returning transactional.id.expiration.ms in the
> > > > DescribeProducersResponse?
> > > >
> > > > That's an interesting thought as well. Are you trying to avoid the
> need
> > > to
> > > > specify it through the command line? The tool could also query the
> > value
> > > > with DescribeConfigs I suppose.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <
> lucas@confluent.io>
> > > > wrote:
> > > >
> > > > > Hi Jason,
> > > > >
> > > > > This looks like a very useful tool, thanks for writing it up.
> > > > >
> > > > > Given that it's possible for replica producer states to diverge
> from
> > > each
> > > > > other, it would be very useful if
> DescribeProducers(Request,Response)
> > > and
> > > > > tooling is able to query all partition replicas for their
> producers.
> > > One
> > > > > way I can see this being used immediately is in kafka's system
> tests,
> > > > > especially the ones that inject failures. At the end of the test we
> > can
> > > > > query all replicas and make sure that their states have not
> > diverged. I
> > > > can
> > > > > also see it being useful when debugging production clusters too.
> > > > >
> > > > > Would it be worth returning transactional.id.expiration.ms in the
> > > > > DescribeProducersResponse?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Lucas
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rn...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > > > >
> > > > > > Ron
> > > > > >
> > > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> > jason@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hey Ron,
> > > > > > >
> > > > > > > We do not typically backport new APIs to older versions. I
> think
> > we
> > > > can
> > > > > > > however make the --abort command compatible with older
> versions.
> > It
> > > > > would
> > > > > > > require a user to do some analysis on their own to identify a
> > > hanging
> > > > > > > transaction, but then they can use the tool from a new release
> to
> > > > > > recover.
> > > > > > > For example, users could detect a hanging transaction through
> the
> > > > > > existing
> > > > > > > "LastStableOffsetLag" metric and then collect the needed
> > > information
> > > > > > from a
> > > > > > > dump of the log (or producer snapshot). It's more work, but at
> > > least
> > > > > it's
> > > > > > > possible. Does that sound fair?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> > rndgstn@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > > > >
> > > > > > > > Will the implementation be backported to prior Kafka
> versions?
> > > The
> > > > > > > reason
> > > > > > > > I ask is because if it is not backported and similar
> > > functionality
> > > > is
> > > > > > not
> > > > > > > > otherwise made available for older versions, then the only
> > > recourse
> > > > > > > (aside
> > > > > > > > from deleting and recreating the topic as you pointed out)
> may
> > be
> > > > to
> > > > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > > > functionality).
> > > > > > > > Such an upgrade may not be desirable, especially if the
> number
> > of
> > > > > > > > intermediate versions is considerable. I understand the
> mantra
> > of
> > > > > > "never
> > > > > > > > fall too many versions behind" but the reality of it is that
> it
> > > > isn't
> > > > > > > > always the case.  Even if the version is relatively recent,
> an
> > > > > upgrade
> > > > > > > may
> > > > > > > > still not be possible for some time, and a quicker resolution
> > may
> > > > be
> > > > > > > > necessary.
> > > > > > > >
> > > > > > > > Ron
> > > > > > > >
> > > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > > jason@confluent.io>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > I've added a proposal to handle the problem of hanging
> > > > > transactions:
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > > .
> > > > > > > > > In theory, this should never happen. In practice, we have
> hit
> > > one
> > > > > bug
> > > > > > > > where
> > > > > > > > > it was possible and there are few good options today to
> > > recover.
> > > > > > Take a
> > > > > > > > > look and let me know what you think.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Jason
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Robert Barrett <bo...@confluent.io>.
Hi Jason,

Thanks for this KIP, I think this will be a huge operational improvement
and overall it looks great to me.

I'm not sure how much value the MaxActiveTransactionDuration metric adds,
given that we have the --find-hanging option in the tool. As you mention,
instances of these transactions are expected to be rare, and a
partition-level metric, which can generate a lot of data, seems very
heavyweight for such a rare occurrence. I think "alert on
PartitionsWithLateTransactionsCount" followed by "run kafka-transactions
--find-hanging on the relevant broker" is a reasonable process for cluster
operators to follow.

Thanks,
Bob

On Thu, Aug 27, 2020 at 9:23 PM Guozhang Wang <wa...@gmail.com> wrote:

> Hi Jason,
>
> Thanks for the written KIP. I think this is going to be a very useful tool
> for operational improvements since with eos in its current stage, we cannot
> confidently assert that we are bug-free, and even in the future when we are
> confident this is still going to be leveraged by older versioned brokers.
> Regarding the solution, I've also debated myself whether Kafka should
> "self-heal" automatically when detected in such situations, or should we
> instead build into ecosystem tooling to let operators do it. And I've also
> convinced myself that the latter should be a better solution to keep Kafka
> software itself simpler.
>
> Regarding the KIP itself, I have a few meta comments below:
>
> 1. I'd like to clarify how we can make "--abort" work with old brokers,
> since without the additional field "Partitions" the tool needs to set the
> coordinator epoch correctly instead of "-1"? Arguably that's still doable
> but would require different call paths, and it's not clear whether that's
> worth doing for old versions.
>
> 2. Why do we have to enforce "DescribeProducers" to be sent to only leaders
> while ListTransactions can be sent to any brokers? Or is it really
> "ListTransactions to be sent to coordinators only"? From the workflow
> you've described, based on the results back from DescribeProducers, we
> should just immediately send ListTransactions to the
> corresponding coordinators based on the collected producer ids, instead of
> trying to send to any brokers right?
>
> Also I'm a bit concerned if "ListTransactions" could potentially return too
> much data with "StateFilters" set to all states, including completed ones.
> Do we expect users ever want to know transactions that are not pending? On
> the other hand, maybe we can just require users to specify the "pids[]" in
> this request too to further filter those un-interested transactions. This
> also works well with the workflow: we know exactly from "DescribeProducers"
> which pids are we diagnosing right now, so in the follow-up
> "ListTransactions" we should also only care for those partitions only.
>
> 3. One thing I'm a bit hesitant about is that, is `Describe` permission on
> the associated topic sufficient to allow any users to get all producer
> information writing to the specific topic-partitions including last
> timestamp, txn-start-timestamp etc, which may be considered sensitive?
> Should we require "ClusterAction" to only allow operators only?
>
> Below are more detailed comments:
>
> 4. From the example it seems "TxnStartOffset" should be included in the
> DescribeTransaction response schema? Otherwise the user would not get it in
> the following WriteTxnMarker request.
>
> 5. It is a bit easier for readers to highlight the added fields in the
> existing WriteTxnMarkerRequest (btw I read is that we are only adding
> "Partitions" with the starting offset, right?). Also as for its response it
> seems we do not make any schema changes except adding one more potential
> error code "INVALID_TXN_STATE" to it, right? If that's the case we can just
> state that explicitly.
>
> 6. It is not clear to me for the overloaded function that the following
> option classes are not specified, what should be the default options?
>
> * ListTransactionsOptions
> * DescribeTransactionsOptions
> * DescribeProducersOptions
>
> Also, it seems AbortTransactionOptions would just be empty? If yes do we
> really need this option class for now?
>
> 7. A couple questions from the cmd tool examples:
> 7.1 Is "--broker" a required or optional (in that case I presume we would
> just query all brokers iteratively) in "--find-hanging"?
> 7.2 Seems "list-producers" is not exposed as a standalone feature in the
> cmd but only used in the wrapping "--find-hanging", is that intentional?
> Personally I feel exposing a "--list-producers" may be useful too: if we
> believe the user has the right ACL, it is legitimate to return the producer
> information to her anyways. But that is debatable in the meta point 3)
> above.
> 7.3 "Describing Transactions": we should also explain how that would be
> executed, e.g. at least we should clarify that we would first find the
> coordinator based on the transactional.id and hence users do not need to
> specify one.
> 7.4. In "Aborting Transactions", should we also specify the "--broker" node
> as a required option? Otherwise we would not know which broker to send to.
>
>
> Overall, nice written one, thanks Jason.
>
> Guozhang
>
>
> On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
> wrote:
>
> > >> Would it be worth returning transactional.id.expiration.ms in the
> > DescribeProducersResponse?
> >
> > > That's an interesting thought as well. Are you trying to avoid the need
> > to
> > specify it through the command line? The tool could also query the value
> > with DescribeConfigs I suppose.
> >
> > Basically. I'm not sure how useful this will be in practice, though it
> > might help when debugging.
> >
> > Lucas
> >
> > On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hey Lucas,
> > >
> > > Thanks for the comments. Responses below:
> > >
> > > > Given that it's possible for replica producer states to diverge from
> > each
> > > other, it would be very useful if DescribeProducers(Request,Response)
> and
> > > tooling is able to query all partition replicas for their producers
> > >
> > > Yes, it makes sense to me to let DescribeProducers work on both
> followers
> > > and leaders. In fact, I'm encouraged that there are use cases for this
> > work
> > > other than detecting hanging transactions. That was indeed the hope,
> but
> > I
> > > didn't have anything specific in mind. I will update the proposal.
> > >
> > > > Would it be worth returning transactional.id.expiration.ms in the
> > > DescribeProducersResponse?
> > >
> > > That's an interesting thought as well. Are you trying to avoid the need
> > to
> > > specify it through the command line? The tool could also query the
> value
> > > with DescribeConfigs I suppose.
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io>
> > > wrote:
> > >
> > > > Hi Jason,
> > > >
> > > > This looks like a very useful tool, thanks for writing it up.
> > > >
> > > > Given that it's possible for replica producer states to diverge from
> > each
> > > > other, it would be very useful if DescribeProducers(Request,Response)
> > and
> > > > tooling is able to query all partition replicas for their producers.
> > One
> > > > way I can see this being used immediately is in kafka's system tests,
> > > > especially the ones that inject failures. At the end of the test we
> can
> > > > query all replicas and make sure that their states have not
> diverged. I
> > > can
> > > > also see it being useful when debugging production clusters too.
> > > >
> > > > Would it be worth returning transactional.id.expiration.ms in the
> > > > DescribeProducersResponse?
> > > >
> > > > Cheers,
> > > >
> > > > Lucas
> > > >
> > > >
> > > >
> > > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rn...@gmail.com>
> > > wrote:
> > > >
> > > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > > >
> > > > > Ron
> > > > >
> > > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <
> jason@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hey Ron,
> > > > > >
> > > > > > We do not typically backport new APIs to older versions. I think
> we
> > > can
> > > > > > however make the --abort command compatible with older versions.
> It
> > > > would
> > > > > > require a user to do some analysis on their own to identify a
> > hanging
> > > > > > transaction, but then they can use the tool from a new release to
> > > > > recover.
> > > > > > For example, users could detect a hanging transaction through the
> > > > > existing
> > > > > > "LastStableOffsetLag" metric and then collect the needed
> > information
> > > > > from a
> > > > > > dump of the log (or producer snapshot). It's more work, but at
> > least
> > > > it's
> > > > > > possible. Does that sound fair?
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <
> rndgstn@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > > >
> > > > > > > Will the implementation be backported to prior Kafka versions?
> > The
> > > > > > reason
> > > > > > > I ask is because if it is not backported and similar
> > functionality
> > > is
> > > > > not
> > > > > > > otherwise made available for older versions, then the only
> > recourse
> > > > > > (aside
> > > > > > > from deleting and recreating the topic as you pointed out) may
> be
> > > to
> > > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > > functionality).
> > > > > > > Such an upgrade may not be desirable, especially if the number
> of
> > > > > > > intermediate versions is considerable. I understand the mantra
> of
> > > > > "never
> > > > > > > fall too many versions behind" but the reality of it is that it
> > > isn't
> > > > > > > always the case.  Even if the version is relatively recent, an
> > > > upgrade
> > > > > > may
> > > > > > > still not be possible for some time, and a quicker resolution
> may
> > > be
> > > > > > > necessary.
> > > > > > >
> > > > > > > Ron
> > > > > > >
> > > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > > jason@confluent.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > I've added a proposal to handle the problem of hanging
> > > > transactions:
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > > .
> > > > > > > > In theory, this should never happen. In practice, we have hit
> > one
> > > > bug
> > > > > > > where
> > > > > > > > it was possible and there are few good options today to
> > recover.
> > > > > Take a
> > > > > > > > look and let me know what you think.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Jason
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Guozhang Wang <wa...@gmail.com>.
Hi Jason,

Thanks for the written KIP. I think this is going to be a very useful tool
for operational improvements since with eos in its current stage, we cannot
confidently assert that we are bug-free, and even in the future when we are
confident this is still going to be leveraged by older versioned brokers.
Regarding the solution, I've also debated myself whether Kafka should
"self-heal" automatically when detected in such situations, or should we
instead build into ecosystem tooling to let operators do it. And I've also
convinced myself that the latter should be a better solution to keep Kafka
software itself simpler.

Regarding the KIP itself, I have a few meta comments below:

1. I'd like to clarify how we can make "--abort" work with old brokers,
since without the additional field "Partitions" the tool needs to set the
coordinator epoch correctly instead of "-1"? Arguably that's still doable
but would require different call paths, and it's not clear whether that's
worth doing for old versions.

2. Why do we have to enforce "DescribeProducers" to be sent to only leaders
while ListTransactions can be sent to any brokers? Or is it really
"ListTransactions to be sent to coordinators only"? From the workflow
you've described, based on the results back from DescribeProducers, we
should just immediately send ListTransactions to the
corresponding coordinators based on the collected producer ids, instead of
trying to send to any brokers right?

Also I'm a bit concerned if "ListTransactions" could potentially return too
much data with "StateFilters" set to all states, including completed ones.
Do we expect users ever want to know transactions that are not pending? On
the other hand, maybe we can just require users to specify the "pids[]" in
this request too to further filter those un-interested transactions. This
also works well with the workflow: we know exactly from "DescribeProducers"
which pids are we diagnosing right now, so in the follow-up
"ListTransactions" we should also only care for those partitions only.

3. One thing I'm a bit hesitant about is that, is `Describe` permission on
the associated topic sufficient to allow any users to get all producer
information writing to the specific topic-partitions including last
timestamp, txn-start-timestamp etc, which may be considered sensitive?
Should we require "ClusterAction" to only allow operators only?

Below are more detailed comments:

4. From the example it seems "TxnStartOffset" should be included in the
DescribeTransaction response schema? Otherwise the user would not get it in
the following WriteTxnMarker request.

5. It is a bit easier for readers to highlight the added fields in the
existing WriteTxnMarkerRequest (btw I read is that we are only adding
"Partitions" with the starting offset, right?). Also as for its response it
seems we do not make any schema changes except adding one more potential
error code "INVALID_TXN_STATE" to it, right? If that's the case we can just
state that explicitly.

6. It is not clear to me for the overloaded function that the following
option classes are not specified, what should be the default options?

* ListTransactionsOptions
* DescribeTransactionsOptions
* DescribeProducersOptions

Also, it seems AbortTransactionOptions would just be empty? If yes do we
really need this option class for now?

7. A couple questions from the cmd tool examples:
7.1 Is "--broker" a required or optional (in that case I presume we would
just query all brokers iteratively) in "--find-hanging"?
7.2 Seems "list-producers" is not exposed as a standalone feature in the
cmd but only used in the wrapping "--find-hanging", is that intentional?
Personally I feel exposing a "--list-producers" may be useful too: if we
believe the user has the right ACL, it is legitimate to return the producer
information to her anyways. But that is debatable in the meta point 3)
above.
7.3 "Describing Transactions": we should also explain how that would be
executed, e.g. at least we should clarify that we would first find the
coordinator based on the transactional.id and hence users do not need to
specify one.
7.4. In "Aborting Transactions", should we also specify the "--broker" node
as a required option? Otherwise we would not know which broker to send to.


Overall, nice written one, thanks Jason.

Guozhang


On Thu, Aug 27, 2020 at 11:44 AM Lucas Bradstreet <lu...@confluent.io>
wrote:

> >> Would it be worth returning transactional.id.expiration.ms in the
> DescribeProducersResponse?
>
> > That's an interesting thought as well. Are you trying to avoid the need
> to
> specify it through the command line? The tool could also query the value
> with DescribeConfigs I suppose.
>
> Basically. I'm not sure how useful this will be in practice, though it
> might help when debugging.
>
> Lucas
>
> On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Lucas,
> >
> > Thanks for the comments. Responses below:
> >
> > > Given that it's possible for replica producer states to diverge from
> each
> > other, it would be very useful if DescribeProducers(Request,Response) and
> > tooling is able to query all partition replicas for their producers
> >
> > Yes, it makes sense to me to let DescribeProducers work on both followers
> > and leaders. In fact, I'm encouraged that there are use cases for this
> work
> > other than detecting hanging transactions. That was indeed the hope, but
> I
> > didn't have anything specific in mind. I will update the proposal.
> >
> > > Would it be worth returning transactional.id.expiration.ms in the
> > DescribeProducersResponse?
> >
> > That's an interesting thought as well. Are you trying to avoid the need
> to
> > specify it through the command line? The tool could also query the value
> > with DescribeConfigs I suppose.
> >
> > Thanks,
> > Jason
> >
> > On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io>
> > wrote:
> >
> > > Hi Jason,
> > >
> > > This looks like a very useful tool, thanks for writing it up.
> > >
> > > Given that it's possible for replica producer states to diverge from
> each
> > > other, it would be very useful if DescribeProducers(Request,Response)
> and
> > > tooling is able to query all partition replicas for their producers.
> One
> > > way I can see this being used immediately is in kafka's system tests,
> > > especially the ones that inject failures. At the end of the test we can
> > > query all replicas and make sure that their states have not diverged. I
> > can
> > > also see it being useful when debugging production clusters too.
> > >
> > > Would it be worth returning transactional.id.expiration.ms in the
> > > DescribeProducersResponse?
> > >
> > > Cheers,
> > >
> > > Lucas
> > >
> > >
> > >
> > > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rn...@gmail.com>
> > wrote:
> > >
> > > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > > >
> > > > Ron
> > > >
> > > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hey Ron,
> > > > >
> > > > > We do not typically backport new APIs to older versions. I think we
> > can
> > > > > however make the --abort command compatible with older versions. It
> > > would
> > > > > require a user to do some analysis on their own to identify a
> hanging
> > > > > transaction, but then they can use the tool from a new release to
> > > > recover.
> > > > > For example, users could detect a hanging transaction through the
> > > > existing
> > > > > "LastStableOffsetLag" metric and then collect the needed
> information
> > > > from a
> > > > > dump of the log (or producer snapshot). It's more work, but at
> least
> > > it's
> > > > > possible. Does that sound fair?
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rn...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > > >
> > > > > > Will the implementation be backported to prior Kafka versions?
> The
> > > > > reason
> > > > > > I ask is because if it is not backported and similar
> functionality
> > is
> > > > not
> > > > > > otherwise made available for older versions, then the only
> recourse
> > > > > (aside
> > > > > > from deleting and recreating the topic as you pointed out) may be
> > to
> > > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > > functionality).
> > > > > > Such an upgrade may not be desirable, especially if the number of
> > > > > > intermediate versions is considerable. I understand the mantra of
> > > > "never
> > > > > > fall too many versions behind" but the reality of it is that it
> > isn't
> > > > > > always the case.  Even if the version is relatively recent, an
> > > upgrade
> > > > > may
> > > > > > still not be possible for some time, and a quicker resolution may
> > be
> > > > > > necessary.
> > > > > >
> > > > > > Ron
> > > > > >
> > > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> > jason@confluent.io>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi All,
> > > > > > >
> > > > > > > I've added a proposal to handle the problem of hanging
> > > transactions:
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > > .
> > > > > > > In theory, this should never happen. In practice, we have hit
> one
> > > bug
> > > > > > where
> > > > > > > it was possible and there are few good options today to
> recover.
> > > > Take a
> > > > > > > look and let me know what you think.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Jason
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Lucas Bradstreet <lu...@confluent.io>.
>> Would it be worth returning transactional.id.expiration.ms in the
DescribeProducersResponse?

> That's an interesting thought as well. Are you trying to avoid the need to
specify it through the command line? The tool could also query the value
with DescribeConfigs I suppose.

Basically. I'm not sure how useful this will be in practice, though it
might help when debugging.

Lucas

On Thu, Aug 27, 2020 at 11:00 AM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Lucas,
>
> Thanks for the comments. Responses below:
>
> > Given that it's possible for replica producer states to diverge from each
> other, it would be very useful if DescribeProducers(Request,Response) and
> tooling is able to query all partition replicas for their producers
>
> Yes, it makes sense to me to let DescribeProducers work on both followers
> and leaders. In fact, I'm encouraged that there are use cases for this work
> other than detecting hanging transactions. That was indeed the hope, but I
> didn't have anything specific in mind. I will update the proposal.
>
> > Would it be worth returning transactional.id.expiration.ms in the
> DescribeProducersResponse?
>
> That's an interesting thought as well. Are you trying to avoid the need to
> specify it through the command line? The tool could also query the value
> with DescribeConfigs I suppose.
>
> Thanks,
> Jason
>
> On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io>
> wrote:
>
> > Hi Jason,
> >
> > This looks like a very useful tool, thanks for writing it up.
> >
> > Given that it's possible for replica producer states to diverge from each
> > other, it would be very useful if DescribeProducers(Request,Response) and
> > tooling is able to query all partition replicas for their producers. One
> > way I can see this being used immediately is in kafka's system tests,
> > especially the ones that inject failures. At the end of the test we can
> > query all replicas and make sure that their states have not diverged. I
> can
> > also see it being useful when debugging production clusters too.
> >
> > Would it be worth returning transactional.id.expiration.ms in the
> > DescribeProducersResponse?
> >
> > Cheers,
> >
> > Lucas
> >
> >
> >
> > On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rn...@gmail.com>
> wrote:
> >
> > > Yes, that definitely sounds reasonable.  Thanks, Jason!
> > >
> > > Ron
> > >
> > > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hey Ron,
> > > >
> > > > We do not typically backport new APIs to older versions. I think we
> can
> > > > however make the --abort command compatible with older versions. It
> > would
> > > > require a user to do some analysis on their own to identify a hanging
> > > > transaction, but then they can use the tool from a new release to
> > > recover.
> > > > For example, users could detect a hanging transaction through the
> > > existing
> > > > "LastStableOffsetLag" metric and then collect the needed information
> > > from a
> > > > dump of the log (or producer snapshot). It's more work, but at least
> > it's
> > > > possible. Does that sound fair?
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rn...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > > >
> > > > > Will the implementation be backported to prior Kafka versions?  The
> > > > reason
> > > > > I ask is because if it is not backported and similar functionality
> is
> > > not
> > > > > otherwise made available for older versions, then the only recourse
> > > > (aside
> > > > > from deleting and recreating the topic as you pointed out) may be
> to
> > > > > upgrade to 2.7 (or whatever version ends up getting this
> > > functionality).
> > > > > Such an upgrade may not be desirable, especially if the number of
> > > > > intermediate versions is considerable. I understand the mantra of
> > > "never
> > > > > fall too many versions behind" but the reality of it is that it
> isn't
> > > > > always the case.  Even if the version is relatively recent, an
> > upgrade
> > > > may
> > > > > still not be possible for some time, and a quicker resolution may
> be
> > > > > necessary.
> > > > >
> > > > > Ron
> > > > >
> > > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <
> jason@confluent.io>
> > > > > wrote:
> > > > >
> > > > > > Hi All,
> > > > > >
> > > > > > I've added a proposal to handle the problem of hanging
> > transactions:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > > .
> > > > > > In theory, this should never happen. In practice, we have hit one
> > bug
> > > > > where
> > > > > > it was possible and there are few good options today to recover.
> > > Take a
> > > > > > look and let me know what you think.
> > > > > >
> > > > > > Thanks,
> > > > > > Jason
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Jason Gustafson <ja...@confluent.io>.
Hey Lucas,

Thanks for the comments. Responses below:

> Given that it's possible for replica producer states to diverge from each
other, it would be very useful if DescribeProducers(Request,Response) and
tooling is able to query all partition replicas for their producers

Yes, it makes sense to me to let DescribeProducers work on both followers
and leaders. In fact, I'm encouraged that there are use cases for this work
other than detecting hanging transactions. That was indeed the hope, but I
didn't have anything specific in mind. I will update the proposal.

> Would it be worth returning transactional.id.expiration.ms in the
DescribeProducersResponse?

That's an interesting thought as well. Are you trying to avoid the need to
specify it through the command line? The tool could also query the value
with DescribeConfigs I suppose.

Thanks,
Jason

On Thu, Aug 27, 2020 at 10:48 AM Lucas Bradstreet <lu...@confluent.io>
wrote:

> Hi Jason,
>
> This looks like a very useful tool, thanks for writing it up.
>
> Given that it's possible for replica producer states to diverge from each
> other, it would be very useful if DescribeProducers(Request,Response) and
> tooling is able to query all partition replicas for their producers. One
> way I can see this being used immediately is in kafka's system tests,
> especially the ones that inject failures. At the end of the test we can
> query all replicas and make sure that their states have not diverged. I can
> also see it being useful when debugging production clusters too.
>
> Would it be worth returning transactional.id.expiration.ms in the
> DescribeProducersResponse?
>
> Cheers,
>
> Lucas
>
>
>
> On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rn...@gmail.com> wrote:
>
> > Yes, that definitely sounds reasonable.  Thanks, Jason!
> >
> > Ron
> >
> > On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hey Ron,
> > >
> > > We do not typically backport new APIs to older versions. I think we can
> > > however make the --abort command compatible with older versions. It
> would
> > > require a user to do some analysis on their own to identify a hanging
> > > transaction, but then they can use the tool from a new release to
> > recover.
> > > For example, users could detect a hanging transaction through the
> > existing
> > > "LastStableOffsetLag" metric and then collect the needed information
> > from a
> > > dump of the log (or producer snapshot). It's more work, but at least
> it's
> > > possible. Does that sound fair?
> > >
> > > Thanks,
> > > Jason
> > >
> > > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rn...@gmail.com>
> > wrote:
> > >
> > > > Hi Jason.  Thanks for the excellently-written KIP.
> > > >
> > > > Will the implementation be backported to prior Kafka versions?  The
> > > reason
> > > > I ask is because if it is not backported and similar functionality is
> > not
> > > > otherwise made available for older versions, then the only recourse
> > > (aside
> > > > from deleting and recreating the topic as you pointed out) may be to
> > > > upgrade to 2.7 (or whatever version ends up getting this
> > functionality).
> > > > Such an upgrade may not be desirable, especially if the number of
> > > > intermediate versions is considerable. I understand the mantra of
> > "never
> > > > fall too many versions behind" but the reality of it is that it isn't
> > > > always the case.  Even if the version is relatively recent, an
> upgrade
> > > may
> > > > still not be possible for some time, and a quicker resolution may be
> > > > necessary.
> > > >
> > > > Ron
> > > >
> > > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <ja...@confluent.io>
> > > > wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > I've added a proposal to handle the problem of hanging
> transactions:
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > > .
> > > > > In theory, this should never happen. In practice, we have hit one
> bug
> > > > where
> > > > > it was possible and there are few good options today to recover.
> > Take a
> > > > > look and let me know what you think.
> > > > >
> > > > > Thanks,
> > > > > Jason
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Lucas Bradstreet <lu...@confluent.io>.
Hi Jason,

This looks like a very useful tool, thanks for writing it up.

Given that it's possible for replica producer states to diverge from each
other, it would be very useful if DescribeProducers(Request,Response) and
tooling is able to query all partition replicas for their producers. One
way I can see this being used immediately is in kafka's system tests,
especially the ones that inject failures. At the end of the test we can
query all replicas and make sure that their states have not diverged. I can
also see it being useful when debugging production clusters too.

Would it be worth returning transactional.id.expiration.ms in the
DescribeProducersResponse?

Cheers,

Lucas



On Wed, Aug 26, 2020 at 12:12 PM Ron Dagostino <rn...@gmail.com> wrote:

> Yes, that definitely sounds reasonable.  Thanks, Jason!
>
> Ron
>
> On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hey Ron,
> >
> > We do not typically backport new APIs to older versions. I think we can
> > however make the --abort command compatible with older versions. It would
> > require a user to do some analysis on their own to identify a hanging
> > transaction, but then they can use the tool from a new release to
> recover.
> > For example, users could detect a hanging transaction through the
> existing
> > "LastStableOffsetLag" metric and then collect the needed information
> from a
> > dump of the log (or producer snapshot). It's more work, but at least it's
> > possible. Does that sound fair?
> >
> > Thanks,
> > Jason
> >
> > On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rn...@gmail.com>
> wrote:
> >
> > > Hi Jason.  Thanks for the excellently-written KIP.
> > >
> > > Will the implementation be backported to prior Kafka versions?  The
> > reason
> > > I ask is because if it is not backported and similar functionality is
> not
> > > otherwise made available for older versions, then the only recourse
> > (aside
> > > from deleting and recreating the topic as you pointed out) may be to
> > > upgrade to 2.7 (or whatever version ends up getting this
> functionality).
> > > Such an upgrade may not be desirable, especially if the number of
> > > intermediate versions is considerable. I understand the mantra of
> "never
> > > fall too many versions behind" but the reality of it is that it isn't
> > > always the case.  Even if the version is relatively recent, an upgrade
> > may
> > > still not be possible for some time, and a quicker resolution may be
> > > necessary.
> > >
> > > Ron
> > >
> > > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I've added a proposal to handle the problem of hanging transactions:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > > .
> > > > In theory, this should never happen. In practice, we have hit one bug
> > > where
> > > > it was possible and there are few good options today to recover.
> Take a
> > > > look and let me know what you think.
> > > >
> > > > Thanks,
> > > > Jason
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Ron Dagostino <rn...@gmail.com>.
Yes, that definitely sounds reasonable.  Thanks, Jason!

Ron

On Wed, Aug 26, 2020 at 3:03 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hey Ron,
>
> We do not typically backport new APIs to older versions. I think we can
> however make the --abort command compatible with older versions. It would
> require a user to do some analysis on their own to identify a hanging
> transaction, but then they can use the tool from a new release to recover.
> For example, users could detect a hanging transaction through the existing
> "LastStableOffsetLag" metric and then collect the needed information from a
> dump of the log (or producer snapshot). It's more work, but at least it's
> possible. Does that sound fair?
>
> Thanks,
> Jason
>
> On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rn...@gmail.com> wrote:
>
> > Hi Jason.  Thanks for the excellently-written KIP.
> >
> > Will the implementation be backported to prior Kafka versions?  The
> reason
> > I ask is because if it is not backported and similar functionality is not
> > otherwise made available for older versions, then the only recourse
> (aside
> > from deleting and recreating the topic as you pointed out) may be to
> > upgrade to 2.7 (or whatever version ends up getting this functionality).
> > Such an upgrade may not be desirable, especially if the number of
> > intermediate versions is considerable. I understand the mantra of "never
> > fall too many versions behind" but the reality of it is that it isn't
> > always the case.  Even if the version is relatively recent, an upgrade
> may
> > still not be possible for some time, and a quicker resolution may be
> > necessary.
> >
> > Ron
> >
> > On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <ja...@confluent.io>
> > wrote:
> >
> > > Hi All,
> > >
> > > I've added a proposal to handle the problem of hanging transactions:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > > .
> > > In theory, this should never happen. In practice, we have hit one bug
> > where
> > > it was possible and there are few good options today to recover. Take a
> > > look and let me know what you think.
> > >
> > > Thanks,
> > > Jason
> > >
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Jason Gustafson <ja...@confluent.io>.
Hey Ron,

We do not typically backport new APIs to older versions. I think we can
however make the --abort command compatible with older versions. It would
require a user to do some analysis on their own to identify a hanging
transaction, but then they can use the tool from a new release to recover.
For example, users could detect a hanging transaction through the existing
"LastStableOffsetLag" metric and then collect the needed information from a
dump of the log (or producer snapshot). It's more work, but at least it's
possible. Does that sound fair?

Thanks,
Jason

On Wed, Aug 26, 2020 at 11:51 AM Ron Dagostino <rn...@gmail.com> wrote:

> Hi Jason.  Thanks for the excellently-written KIP.
>
> Will the implementation be backported to prior Kafka versions?  The reason
> I ask is because if it is not backported and similar functionality is not
> otherwise made available for older versions, then the only recourse (aside
> from deleting and recreating the topic as you pointed out) may be to
> upgrade to 2.7 (or whatever version ends up getting this functionality).
> Such an upgrade may not be desirable, especially if the number of
> intermediate versions is considerable. I understand the mantra of "never
> fall too many versions behind" but the reality of it is that it isn't
> always the case.  Even if the version is relatively recent, an upgrade may
> still not be possible for some time, and a quicker resolution may be
> necessary.
>
> Ron
>
> On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <ja...@confluent.io>
> wrote:
>
> > Hi All,
> >
> > I've added a proposal to handle the problem of hanging transactions:
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> > .
> > In theory, this should never happen. In practice, we have hit one bug
> where
> > it was possible and there are few good options today to recover. Take a
> > look and let me know what you think.
> >
> > Thanks,
> > Jason
> >
>

Re: [DISCUSS] KIP-664: Provide tooling to detect and abort hanging transactions

Posted by Ron Dagostino <rn...@gmail.com>.
Hi Jason.  Thanks for the excellently-written KIP.

Will the implementation be backported to prior Kafka versions?  The reason
I ask is because if it is not backported and similar functionality is not
otherwise made available for older versions, then the only recourse (aside
from deleting and recreating the topic as you pointed out) may be to
upgrade to 2.7 (or whatever version ends up getting this functionality).
Such an upgrade may not be desirable, especially if the number of
intermediate versions is considerable. I understand the mantra of "never
fall too many versions behind" but the reality of it is that it isn't
always the case.  Even if the version is relatively recent, an upgrade may
still not be possible for some time, and a quicker resolution may be
necessary.

Ron

On Wed, Aug 26, 2020 at 2:33 PM Jason Gustafson <ja...@confluent.io> wrote:

> Hi All,
>
> I've added a proposal to handle the problem of hanging transactions:
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-664%3A+Provide+tooling+to+detect+and+abort+hanging+transactions
> .
> In theory, this should never happen. In practice, we have hit one bug where
> it was possible and there are few good options today to recover. Take a
> look and let me know what you think.
>
> Thanks,
> Jason
>