You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Yubiao Feng <yu...@streamnative.io.INVALID> on 2023/02/12 12:02:34 UTC

Re: [DISCUSS] PIP-240 A new API to unload subscriptions

I started the voting process for this PIP

Thanks
Yubiao

On Thu, Jan 19, 2023 at 5:55 PM Haiting Jiang <ji...@gmail.com>
wrote:

> I agree with Penghui & Xiaolong,
>
> 1. Restarting a service is usually the most common and effective
> option for service maintainers to recover a service and minimize the
> business loss.
> With this subscription unloading, we can reduce the impact
> significantly, as unloading topics will affect message writing, which
> has much more influence for online business.
>
> 2. Having this subscription doesn't conflict with solving the real
> issue. Like broker restarting, it just can buy us more time to locate
> the real problem.
>
> BR,
> Haiting
>
> On Thu, Jan 19, 2023 at 11:42 AM rxl@apache.org
> <ra...@gmail.com> wrote:
> >
> > Hello Joe and Enrico:
> >
> > I agree with what you've been emphasizing that we need to fix these
> issues
> > at the root cause. During the maintenance of the Go SDK, we have
> > encountered many stuck problems since version 0.4.0, some of which
> belonged
> > to the logic errors handled by the Go SDK itself, and some of which were
> > caused by the user's wrong use of the Go SDK, until the previous 0.8 .0
> > version, the Go SDK is used on a large scale in our environment. In the
> > iterations of these versions, we have been trying to completely fix these
> > BUGs. This is what our maintainers have been working hard on and it is
> also
> > a final form we expect Pulsar - everything looks OK.
> >
> > However, during the iteration of the Go SDK version from 0.4.0 to 0.8.0,
> > users of our production environment encountered similar problems many
> > times. Again, for a user in a production environment, for example, the
> > current user encounters a situation where consumption is blocked. The
> user
> > finds you and expects us to use some means to quickly allow consumers to
> > continue to consume news? Or do we keep users in the production
> environment
> > in a stuck state until we find the root cause of the problem and fix it
> for
> > users, pushing users to upgrade. I think everyone's answer tends to be
> the
> > latter. We will not directly expose the hack operations of unload topic
> and
> > unload sub to users, but to Pulsar's operation and maintenance personnel,
> > so it is more like an operation and maintenance tool , rather than the
> > interface called by the user. So I think this impact is controllable for
> > Pulsar as a whole, which is why I support it.
> >
> > Again, this PIP is more about buying more time for us to locate the
> problem
> > while minimizing the impact on production users. It’s not that with this
> > interface we don’t locate the real causes of the stuck. On the contrary,
> we
> > are making more trade-offs between users and positioning issues, buying
> us
> > more time for positioning issues.
> >
> > --
> > Thanks
> > xiaolong ran
> >
> > PengHui Li <pe...@apache.org> 于2023年1月18日周三 11:48写道:
> >
> > > > What kind of problems is this trying to fix?
> > > And why cannot that be solved by client-side fixes?
> > >
> > > Yes, most of the issue is from the client side, rarely from the broker.
> > > But the application also needs time to fix the issue to release and
> deploy
> > > the fix
> > > to the production environment. Unloading the subscription is just a
> > > temporary
> > > way to mitigate the issue and reduce the impact. It will not fix the
> issue
> > > completely.
> > >
> > > What I learned is to capture the heap dump, topics stats, internal
> stats,
> > > and logs from the broker and client and then try to unload the topic to
> > > see if the problem is mitigated. If not, then try to restart the
> broker or
> > > client,
> > > most of the time, the problem can be mitigated in this way.
> > > Then we can continue to reproduce the issue and investigate the issue
> > > from the captured heap dump and logs.
> > >
> > > > In shared sub issues, it's hard to  pinpoint which consumer/where
> > > the problem lies, and to reset that one at the client. The totality of
> > > state spread between the brokers and all the consumers of the shared
> sub
> > > needs to be put together .  Is that why we are doing this?
> > >
> > > From my experience, most are from Shared and key shared subscriptions.
> > > Most of the issues come from misuse, rarely from the BUGs of brokers or
> > > clients.
> > >
> > > Regards,
> > > Penghui
> > >
> > >
> > > On Wed, Jan 18, 2023 at 11:31 AM Joe F <jo...@gmail.com> wrote:
> > >
> > > > Inclined to agree with Enrico.  If it's a hard problem, it will
> repeat,
> > > and
> > > > this is not helping.  If it's some race on the client, it will occur
> > > > randomly and rarely, and this unload sub will get programmed in as a
> way
> > > of
> > > > life.
> > > >
> > > > >If you don't think unloading the subscription can't help anything.
> > > > Unloading
> > > > the topic should be the same. From my experience, most of the
> unloading
> > > > topic operations are to mitigate the problems related to message
> > > > consumption.
> > > >
> > > > Comparisons with unloading a topic are not the bar here, as that is a
> > > first
> > > > class broker utility that is needed for operational reasons outside
> of
> > > > "fixing"  consumer side issues . The side effect of using "unload
> topic"
> > > is
> > > > a loss of transient topic state. I will fully agree that this
> side-effect
> > > > has been  pervasively abused for fixing problems (ala Ctlrl-Alt-Del)
> ,
> > > but
> > > > that's not the rationale for having an unload topic utility.
> > > >
> > > > What kind of problems is this trying to fix?
> > > > And why cannot that be solved by client-side fixes?
> > > >
> > > > In shared sub issues, it's hard to  pinpoint which consumer/where
> > > > the problem lies, and to reset that one at the client. The totality
> of
> > > > state spread between the brokers and all the consumers of the shared
> sub
> > > > needs to be put together .  Is that why we are doing this?
> > > >
> > > >
> > > > On Tue, Jan 17, 2023 at 5:30 PM PengHui Li <pe...@apache.org>
> wrote:
> > > >
> > > > > I agree that if we encounter a stuck consumption issue, we should
> > > > continue
> > > > > to find the root cause of the problem.
> > > > >
> > > > > Subscription unloading is just an option to mitigate the impact
> first.
> > > > > Maybe it can mitigate the issue for 1 hour sometimes. Especially in
> > > > > key_shared subscription. Sometimes it's not a BUG from Pulsar.
> > > > > But users need time to fix the issue. But it doesn't make sense to
> let
> > > > > the impaction continues until the fix is applied.
> > > > >
> > > > > I also helped many people to troubleshoot the stuck consumption
> > > > > issue related to key_shared subscriptions and transactions etc.
> > > > > In most cases, unloading the topic can mitigate the impact.
> > > > > For example, due to the un-catched exception, the dispatch thread
> > > > > stopped reading messages from the managed-ledger. The exception
> > > > > is a very infrequent occurrence. Unloading the topic is the best
> choice
> > > > for
> > > > > now, right?
> > > > >
> > > > > If you don't think unloading the subscription can't help anything.
> > > > > Unloading
> > > > > the topic should be the same. From my experience, most of the
> unloading
> > > > > topic operations are to mitigate the problems related to message
> > > > > consumption.
> > > > >
> > > > > Best,
> > > > > Penghui
> > > > >
> > > > > On Tue, Jan 17, 2023 at 11:09 PM Enrico Olivelli <
> eolivelli@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Il giorno lun 16 gen 2023 alle ore 11:58 rxl@apache.org
> > > > > > <ra...@gmail.com> ha scritto:
> > > > > > >
> > > > > > > I agree with @Enrico @Bo, if we encounter a subscribe stuck
> > > > situation,
> > > > > we
> > > > > > > must continue to spend more time to locate and fix this
> problem,
> > > > which
> > > > > is
> > > > > > > what we have been doing.
> > > > > > >
> > > > > > > But let's think about this problem from another angle. At this
> > > time,
> > > > a
> > > > > > user
> > > > > > > in the production environment encounters a consumer stuck
> > > situation,
> > > > > what
> > > > > > > should we do? For a user in a production environment, our first
> > > > > reaction
> > > > > > > when encountering a problem is how to quickly recover and how
> to
> > > > > quickly
> > > > > > > reduce user losses. Even at this point in time, we don't think
> > > about
> > > > > > > whether this is a bug on the Broker side, a bug on the SDK
> side,
> > > or a
> > > > > bug
> > > > > > > used by the user himself? In the process of fast recovery, our
> most
> > > > > > common
> > > > > > > method is to quickly re-establish the connection between the
> broker
> > > > and
> > > > > > the
> > > > > > > client through the topic specified by unload. In this process,
> we
> > > try
> > > > > to
> > > > > > > retain as much context as possible to assist us in the
> subsequent
> > > > > > > continuous positioning and repair of this problem.
> > > > > > >
> > > > > > > So I don't think these two things conflict. Why we expose the
> admin
> > > > CLI
> > > > > > of
> > > > > > > the unload topic is why we expect to expose the unload
> subscribe.
> > > If
> > > > we
> > > > > > > stand from the perspective of a developer, we definitely want
> to
> > > > > > completely
> > > > > > > fix the problem that caused the stuck. If we think about this
> issue
> > > > > from
> > > > > > > the perspective of the user, when a scenario such as consumer
> stuck
> > > > > > occurs
> > > > > > > to the user, the user does not care about the specific cause
> of the
> > > > > > > problem, but expects the business to recover quickly in the
> > > shortest
> > > > > > > possible time to avoid further loss.
> > > > > > >
> > > > > > > I admit that this is a relatively hacky way, but it can indeed
> > > solve
> > > > > the
> > > > > > > problems we are currently encountering, and at the same time,
> it
> > > will
> > > > > not
> > > > > > > cause a major conflict with Pulsar's existing logic. So I still
> > > > insist
> > > > > on
> > > > > > > agreeing with yubiao's point of view.
> > > > > >
> > > > > >
> > > > > >
> > > > > > Usually when a subscription is "stuck" even if you unload the
> topic
> > > > > > it returns to the "stuck" state again if you don't solve the
> problem.
> > > > > >
> > > > > > This is a very common issue with Pulsar users, I am spending much
> > > time
> > > > > > helping users to troubleshoot their production problems and
> unloading
> > > > the
> > > > > > topic
> > > > > > is never a solution, it can give you seconds, minutes or hours of
> > > > > > "working state",
> > > > > > then the problem will happen again.
> > > > > >
> > > > > > You say that it can solve the problems you are encountering.
> > > > > > Could you please give more context ? (in Slack if this is not
> > > > > > something that can be discussed in public)
> > > > > > I apologise if I seem  too much of a skeptic this time, I am sure
> > > that
> > > > > > you have a real problem
> > > > > > and you want to fix it, but I would like to help you find the
> best
> > > way.
> > > > > >
> > > > > > Pulsar is used by many people and we shouldn't add hacky tools
> for
> > > > > > temporary workarounds.
> > > > > > Once we deliver an API we should maintain it for an unlimited
> time.
> > > > > >
> > > > > > You could patch your system and use the patched version
> temporarily
> > > > > > until you find the root case.
> > > > > >
> > > > > > Enrico
> > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Thanks
> > > > > > > Xiaolong Ran
> > > > > > >
> > > > > > >
> > > > > > > Yubiao Feng <yu...@streamnative.io.invalid>
> 于2023年1月15日周日
> > > > > 20:59写道:
> > > > > > >
> > > > > > > > Hi Qiang
> > > > > > > >
> > > > > > > > > 1. How do you handle the race condition when you are
> trying to
> > > > > > unload the
> > > > > > > > subscription, and the new consumer wants to subscribe to this
> > > > > > subscription
> > > > > > > > at the same time? I'm unsure if it has the race condition. I
> just
> > > > > want
> > > > > > to
> > > > > > > > remind you about that.:)
> > > > > > > >
> > > > > > > > These methods `addConsumer`, `removeConsumer` all have
> > > synchronized
> > > > > > locks,
> > > > > > > > we also add synchronized lock when executing `reset
> subscription`
> > > > can
> > > > > > solve
> > > > > > > > the problem.
> > > > > > > >
> > > > > > > > > 2. Would you like to add some restful API design to
> clarify the
> > > > > > > > implementation?
> > > > > > > >
> > > > > > > > Already added the rest API design in the proposal
> > > > > > > > https://github.com/apache/pulsar/issues/19187
> > > > > > > >
> > > > > > > > On Thu, Jan 12, 2023 at 3:22 PM <ma...@gmail.com>
> wrote:
> > > > > > > >
> > > > > > > > > Hi, Yubiao
> > > > > > > > >
> > > > > > > > > I agree with this idea because some users care about the
> > > > production
> > > > > > rate.
> > > > > > > > > They don't want to unload the whole topic to fix the
> > > subscription
> > > > > > > > problem.
> > > > > > > > >
> > > > > > > > > I've got some questions:
> > > > > > > > >
> > > > > > > > > 1. How do you handle the race condition when you are
> trying to
> > > > > > unload the
> > > > > > > > > subscription, and the new consumer wants to subscribe to
> this
> > > > > > > > subscription
> > > > > > > > > at the same time? I'm unsure if it has the race condition.
> I
> > > just
> > > > > > want to
> > > > > > > > > remind you about that. :)
> > > > > > > > > 2. Would you like to add some restful API design to
> clarify the
> > > > > > > > > implementation?
> > > > > > > > >     a. Request method
> > > > > > > > >     b. Request path
> > > > > > > > >     c. Response code
> > > > > > > > >     d. etc.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks for your work.
> > > > > > > > > Mattison
> > > > > > > > > On Jan 11, 2023, 17:01 +0800, Yubiao Feng <
> > > > > > yubiao.feng@streamnative.io
> > > > > > > > .invalid>,
> > > > > > > > > wrote:
> > > > > > > > > > Hi community
> > > > > > > > > >
> > > > > > > > > > I am starting a DISCUSS for PIP-240: A new API to unload
> > > > > > subscriptions.
> > > > > > > > > >
> > > > > > > > > > PIP issue: https://github.com/apache/pulsar/issues/19187
> > > > > > > > > >
> > > > > > > > > > ### Motivation
> > > > > > > > > >
> > > > > > > > > > We sometimes try to unload the topic to resolve some
> > > > > > consumption-stop
> > > > > > > > > > issues. But the unloading topic will also impact the
> producer
> > > > > side.
> > > > > > > > > >
> > > > > > > > > > ### Goal
> > > > > > > > > >
> > > > > > > > > > Providing a new API to unload the subscription dimension
> > > > triggers
> > > > > > > > > > reconnection of all consumers on that subscription and
> > > > > > reconnection is
> > > > > > > > > > guaranteed by the client. The API will be used in these
> ways:
> > > > > > > > > > - unload special subscription of one topic(or partitioned
> > > > topic)
> > > > > > > > > > - unload all subscriptions of one topic(or partitioned
> topic)
> > > > > > > > > > - unload subscriptions of one topic(or partitioned
> topic) by
> > > > > > regular
> > > > > > > > > > expression
> > > > > > > > > > - If a reader's subscription name is not set, a random
> > > > > subscription
> > > > > > > > name
> > > > > > > > > > prefixed with 'multiTopicsReader-' or 'reader-' will be
> used,
> > > > and
> > > > > > users
> > > > > > > > > can
> > > > > > > > > > uninstall these subscriptions using regular expressions.
> > > > > > > > > >
> > > > > > > > > > In addition to triggering consumer disconnection,
> Unloading
> > > > > > Subscribers
> > > > > > > > > > will restart the Dispatcher, which resets the redeliver
> > > message
> > > > > > queue
> > > > > > > > and
> > > > > > > > > > delayed message queue in the Broker's memory, which can
> help
> > > > > > resolve
> > > > > > > > > issues
> > > > > > > > > > caused by an abnormal dispatcher state. However, the
> > > execution
> > > > > > flow of
> > > > > > > > > > Unloading Subscribers does not include a restart of the
> > > Managed
> > > > > > Cursor
> > > > > > > > > > related to this dispatcher; if there is a problem with
> the
> > > > > cursor,
> > > > > > we
> > > > > > > > can
> > > > > > > > > > only rely on the unload topic to solve it.
> > > > > > > > > >
> > > > > > > > > > Note: From the client's perspective, this connection may
> be
> > > > > shared
> > > > > > by
> > > > > > > > > > consumers, producers, and transactions, so Unloading
> > > > Subscribers
> > > > > > maybe
> > > > > > > > > > impact the producer and transaction.
> > > > > > > > > >
> > > > > > > > > > #### These scenarios are not supported
> > > > > > > > > > - Functions `message-dedup`, `geo-replication,` and
> > > > > `shadow-topic`
> > > > > > also
> > > > > > > > > > read messages from the topic, but Unloading subscribers
> will
> > > > not
> > > > > > > > support
> > > > > > > > > > triggering restarts of these three functions( because the
> > > > cursor
> > > > > is
> > > > > > > > used
> > > > > > > > > > directly to read the data in these scenarios, not the
> > > consumer
> > > > or
> > > > > > > > reader
> > > > > > > > > ).
> > > > > > > > > > - The Compression task(subscription name is
> `__compaction`)
> > > > also
> > > > > > use a
> > > > > > > > > > reader to read data, but Unloading Subscribers does not
> > > support
> > > > > it
> > > > > > > > > because
> > > > > > > > > > this task creates a new reader each time it starts.
> > > > > > > > > > - Do not support all topics related to Transaction
> features.
> > > > > > > > > > - `__transaction_buffer_snapshot` works with the task TB
> > > > recover,
> > > > > > and
> > > > > > > > > > this task will create a new reader each time they start.
> > > > > > > > > > - `__transaction_pending_ack` works with the task
> Transaction
> > > > > > Pending
> > > > > > > > Ack
> > > > > > > > > > Store replay, and this task will use managed cursor
> directly
> > > to
> > > > > > read
> > > > > > > > > data.
> > > > > > > > > > - `__transaction_log_xxx` works with the task Transaction
> > > Log,
> > > > > > which
> > > > > > > > will
> > > > > > > > > > use managed cursor directly to read data.
> > > > > > > > > > - `transaction_coordinator_assign` No data will be
> written on
> > > > > this
> > > > > > > > topic.
> > > > > > > > > >
> > > > > > > > > > #### Special system topic supports
> > > > > > > > > > The system topic `__change_events` is used to support
> > > > topic-level
> > > > > > > > > policies,
> > > > > > > > > > there may also be some message delivery issues in this
> > > > scenario,
> > > > > so
> > > > > > > > > > Unloading Subscribers will support this topic.
> > > > > > > > > >
> > > > > > > > > > ### API Changes
> > > > > > > > > >
> > > > > > > > > > #### For persistent topic
> > > > > > > > > > ```
> > > > > > > > > > pulsar-admin persistent unload {topic_name} -s {sub_name}
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > #### For non-persistent topic
> > > > > > > > > > ```
> > > > > > > > > > pulsar-admin non-persistent unload {topic_name} -s
> {sub_name}
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > #### Explain the param `-s`
> > > > > > > > > > - set param `-s` to special sub name to unload special
> > > > > subscription
> > > > > > > > > > - set param `-s` to `**` to unload all subscriptions
> under
> > > this
> > > > > > topic
> > > > > > > > > > - set param `-s` to `regexp` to unload a batch
> subscriptions
> > > > > under
> > > > > > this
> > > > > > > > > > topic
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks
> > > > > > > > > > Yubiao Feng
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>