You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pulsar.apache.org by Joe Francis <jo...@verizonmedia.com.INVALID> on 2021/01/23 00:59:47 UTC

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Let me take a step back and explain  how I am looking at this from a
high-level
design viewpoint


Bookkeeper (BK) is like an LSM implementation of a KV store. Writes to all
keys are appended to a single file; deletes are logical.  Compaction
reclaims space.  An Index is used locate entries, tracking logical deletes
and reclaim space.


The index in BK  is another LSM.  Again, writes are appended, deletes are
logical, and  an index is used to  locate entries , account for deletes and
compaction to reclaim space (the implementation within rocksdb is far more
complex with bloom filters and memtables, but you get the idea )   BK just
uses a sophisticated index (rocksdb) which is tiny and cacheable and
rocksdb has within it a sophisticated index which is small and cacheable


So when I look at this proposal, what I see is the same - another attempt
to build an LSM with a sophisticated index/cache mechanism using log
structured storage. So I am quite skeptical that this needs to solved this
way,  within Pulsar.



Joe

On Wed, Jan 20, 2021 at 12:30 AM linlin <li...@apache.org> wrote:

> We can look at ManagedCursorImpl.buildIndividualDeletedMessageRanges
>
> What is saved in the entry is not a bitSet, but a messageRange one by one,
> which contains information such as ledgerId and entryId. BitSet only exists
> in the memory and is used to quickly determine whether it already exists.
> In addition, the position of each ack will be stored in the
> individualDeletedMessages queue. When persisted to the entry, the queue
> will be traversed, and the position information of each ack will generate a
> messageRange.
> A messageRange contains lowerEndpoint (ledgerId+entryId), upperEndpoint
> (ledgerId+entryId), 4 longs, about 256 bits.
>
> We assume a more extreme scenario, 300K messages, every other ack has an
> unacknowledged, that is, 150K location information will be stored in
> individualDeletedMessages. 150K * 256/8/1024 /1024 ≈ 4.6MB
> Of course, there are also scenarios where the customer's ack spans several
> ledgers.
>
>
> On 2021/01/20 00:38:47, Joe F <j....@gmail.com> wrote:
> > I have a simpler question. Just storing the message-ids raw will fit
> ~300K>
> > entries in one ledger entry. With the bitmap  changes, we can store a>
> > couple of million  within one 5MB ledger entry.  So can you tell us what>
> > numbers of unacked messages are  creating a problem?  What exactly are
> the>
> > issues you face, and at what numbers of unacked messages/memory use etc?>
> >
> > I have my own concerns about this proposal, but I would like to
> understand>
> > the problem first>
> >
> > Joe>
> >
> > On Sun, Jan 17, 2021 at 10:16 PM Sijie Guo <gu...@gmail.com> wrote:>
> >
> > > Hi Lin,>
> > >>
> > > Thanks you and Penghui for drafting this! We have seen a lot of pain
> points>
> > > of `managedLedgerMaxUnackedRangesToPersist` when enabling delayed
> messages.>
> > > Glad that you and Penghui are spending time on resolving this!>
> > >>
> > > Overall the proposal looks good. But I have a couple of questions about
> the>
> > > proposal.>
> > >>
> > > 1. What happens if the broker fails to write the entry marker? For
> example,>
> > > at t0, the broker flushes dirty pages and successfully writes an entry>
> > > marker. At t1, the broker tries to flushes dirty pages but failed to
> write>
> > > the new entry marker. How can you recover the entry marker?>
> > >>
> > > 2.  When a broker crashes and recovers the managed ledger, the cursor>
> > > ledger is not writable anymore. Are you going to create a new cursor
> ledger>
> > > and copy all the entries from the old cursor ledger to the new one?>
> > >>
> > > It would be good if you can clarify these two questions.>
> > >>
> > > - Sijie>
> > >>
> > > On Sun, Jan 17, 2021 at 9:48 PM linlin <li...@apache.org> wrote:>
> > >>
> > > > Hi, community:>
> > > >     Recently we encountered some problems when using individual>
> > > > acknowledgments, such as:>
> > > > when the amount of acknowledgment is large, entry writing fails; a
> large>
> > > > amount of cache causes OOM, etc.>
> > > > So I drafted a PIP in `>
> > > >>
> > > >>
> > >
>
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing`
> <https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing>
> >
>
> > > <
>
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >>
>
> > > > <>
> > >
>
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >
>
> > > >>
> > > > ,>
> > > > any voice is welcomed.>
> > > >>
> > >>
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by PengHui Li <pe...@apache.org>.

Hi Rajan,

10M unack-message ranges can respond to most use-cases
but there are still some scenarios that cannot be dealt with.
We can make an abstraction for users who want to extend the implementation.
I think this will not conflict with your improvement.

This PIP not just to improve the persistent of the individual acks,
there are others that the PR does not been covered such as the individual
acks memory limitation.
Every time mark delete changes, we need to write a few MB of data to
bookies, if you have 1000 topics with
this scenario, it's also a point to worry about.
We do not expect one implementation to solve all the problems of users.

BTW, I don't think introducing an abstraction will bring complexity in
message acknowledgment.
This is help to make the various components of managed-ledger are clearly
blamed.

Thanks
Penghui



Rajan Dhabalia <rd...@apache.org> 于2021年1月27日周三 下午12:00写道：

> I have created a PR which should allow brokers to store up to 10M
> unack-message ranges. I think it should be large enough for any usecases
> and probably now, we might not need to introduce abstraction for ack
> management to avoid any further complexity in message acknowledgement path
> as well.
> https://github.com/apache/pulsar/pull/9292
>
> Thanks,
> Rajan
>
>
>
> On Sun, Jan 24, 2021 at 7:14 PM Sijie Guo <gu...@gmail.com> wrote:
>
> > Rajan,
> >
> > Thank you for sharing the prototype! That looks great to me.
> >
> > In order for the community to evolve and experiment with different
> > approaches, how about we abstract this acknowledgment management as an
> > interface?
> >
> > If this approach works for you, maybe Lin Lin can focus on abstracting
> this
> > interface. Then we can implement your approach and his approach using
> this
> > interface. Users can decide which implementation to use.
> >
> > Thanks,
> > Sijie
> >
> > On Sat, Jan 23, 2021 at 7:46 PM Rajan Dhabalia <rd...@apache.org>
> > wrote:
> >
> > > *Hi, >> Delayed messages or certain user logic can introduce a lot of
> > > message-holes. We have seen this issue in quite a lot of customers'
> > > production environment.I agree that we need a larger buffer to store
> and
> > > recover individually deleted messages, and the existing 150K limit
> might
> > > not be enough for many usescases. However, I would also like to
> highlight
> > > issues that happened in the past due to keeping a large number of
> deleted
> > > messages in the broker’s memory and one of them was high GC pauses.
> > > Therefore, we introduced ConcurrentOpenLongPairRangeSet
> > > <https://github.com/apache/pulsar/pull/3818> to manage deleted
> messages
> > > without actually storing range objects in memory. OpenRangeSet uses
> > bitset
> > > to store ranges in memory and we can also utilize it to persist in disk
> > for
> > > the recovery. This approach has various advantages: simple
> > implementation,
> > > large enough range for recovery, and it skips intermediate conversion
> > from
> > > unack-messages to bitset in OpenRangeSet which saves extra CPU while
> > > recovery.I implemented a simple prototype
> > > <
> > >
> >
> https://github.com/rdhabalia/pulsar/commit/1f8e5e745e9f1d1429697b5dee1da70545385653
> > > >
> > > to store deleted messages using bitset in OpenRangeSet and we can
> persist
> > > 10M ranges with 5MB data size which I guess is large enough for any
> > > usecases. So, we can use this approach to solve the problem without
> > > introducing unnecessary complexity in managed-cursor.Thanks,Rajan*
> > >
> > > On Fri, Jan 22, 2021 at 7:52 PM Sijie Guo <gu...@gmail.com> wrote:
> > >
> > > > Joe - Delayed messages or certain user logic can introduce a lot of
> > > message
> > > > holes. We have seen this issue in quite a lot of customers'
> production
> > > > environment. Hence we need to find a solution for solving these
> > problems.
> > > > If you are skeptical of an implementation like that, how about us
> > making
> > > > cursor implementation pluggable. We can make this proposal
> implemented
> > as
> > > > one plugin. So it will not impact any existing logic but allowing
> > people
> > > > use a plugin to solve this problem.
> > > >
> > > > Thanks,
> > > > Sijie
> > > >
> > > > On Fri, Jan 22, 2021 at 5:00 PM Joe Francis
> > > <joef@verizonmedia.com.invalid
> > > > >
> > > > wrote:
> > > >
> > > > > Let me take a step back and explain  how I am looking at this from
> a
> > > > > high-level
> > > > > design viewpoint
> > > > >
> > > > >
> > > > > Bookkeeper (BK) is like an LSM implementation of a KV store. Writes
> > to
> > > > all
> > > > > keys are appended to a single file; deletes are logical.
> Compaction
> > > > > reclaims space.  An Index is used locate entries, tracking logical
> > > > deletes
> > > > > and reclaim space.
> > > > >
> > > > >
> > > > > The index in BK  is another LSM.  Again, writes are appended,
> deletes
> > > are
> > > > > logical, and  an index is used to  locate entries , account for
> > deletes
> > > > and
> > > > > compaction to reclaim space (the implementation within rocksdb is
> far
> > > > more
> > > > > complex with bloom filters and memtables, but you get the idea )
>  BK
> > > > just
> > > > > uses a sophisticated index (rocksdb) which is tiny and cacheable
> and
> > > > > rocksdb has within it a sophisticated index which is small and
> > > cacheable
> > > > >
> > > > >
> > > > > So when I look at this proposal, what I see is the same - another
> > > attempt
> > > > > to build an LSM with a sophisticated index/cache mechanism using
> log
> > > > > structured storage. So I am quite skeptical that this needs to
> solved
> > > > this
> > > > > way,  within Pulsar.
> > > > >
> > > > >
> > > > >
> > > > > Joe
> > > > >
> > > > > On Wed, Jan 20, 2021 at 12:30 AM linlin <li...@apache.org> wrote:
> > > > >
> > > > > > We can look at
> > ManagedCursorImpl.buildIndividualDeletedMessageRanges
> > > > > >
> > > > > > What is saved in the entry is not a bitSet, but a messageRange
> one
> > by
> > > > > one,
> > > > > > which contains information such as ledgerId and entryId. BitSet
> > only
> > > > > exists
> > > > > > in the memory and is used to quickly determine whether it already
> > > > exists.
> > > > > > In addition, the position of each ack will be stored in the
> > > > > > individualDeletedMessages queue. When persisted to the entry, the
> > > queue
> > > > > > will be traversed, and the position information of each ack will
> > > > > generate a
> > > > > > messageRange.
> > > > > > A messageRange contains lowerEndpoint (ledgerId+entryId),
> > > upperEndpoint
> > > > > > (ledgerId+entryId), 4 longs, about 256 bits.
> > > > > >
> > > > > > We assume a more extreme scenario, 300K messages, every other ack
> > has
> > > > an
> > > > > > unacknowledged, that is, 150K location information will be stored
> > in
> > > > > > individualDeletedMessages. 150K * 256/8/1024 /1024 ≈ 4.6MB
> > > > > > Of course, there are also scenarios where the customer's ack
> spans
> > > > > several
> > > > > > ledgers.
> > > > > >
> > > > > >
> > > > > > On 2021/01/20 00:38:47, Joe F <j....@gmail.com> wrote:
> > > > > > > I have a simpler question. Just storing the message-ids raw
> will
> > > fit
> > > > > > ~300K>
> > > > > > > entries in one ledger entry. With the bitmap  changes, we can
> > store
> > > > a>
> > > > > > > couple of million  within one 5MB ledger entry.  So can you
> tell
> > us
> > > > > what>
> > > > > > > numbers of unacked messages are  creating a problem?  What
> > exactly
> > > > are
> > > > > > the>
> > > > > > > issues you face, and at what numbers of unacked messages/memory
> > use
> > > > > etc?>
> > > > > > >
> > > > > > > I have my own concerns about this proposal, but I would like to
> > > > > > understand>
> > > > > > > the problem first>
> > > > > > >
> > > > > > > Joe>
> > > > > > >
> > > > > > > On Sun, Jan 17, 2021 at 10:16 PM Sijie Guo <gu...@gmail.com>
> > > wrote:>
> > > > > > >
> > > > > > > > Hi Lin,>
> > > > > > > >>
> > > > > > > > Thanks you and Penghui for drafting this! We have seen a lot
> of
> > > > pain
> > > > > > points>
> > > > > > > > of `managedLedgerMaxUnackedRangesToPersist` when enabling
> > delayed
> > > > > > messages.>
> > > > > > > > Glad that you and Penghui are spending time on resolving
> this!>
> > > > > > > >>
> > > > > > > > Overall the proposal looks good. But I have a couple of
> > questions
> > > > > about
> > > > > > the>
> > > > > > > > proposal.>
> > > > > > > >>
> > > > > > > > 1. What happens if the broker fails to write the entry
> marker?
> > > For
> > > > > > example,>
> > > > > > > > at t0, the broker flushes dirty pages and successfully writes
> > an
> > > > > entry>
> > > > > > > > marker. At t1, the broker tries to flushes dirty pages but
> > failed
> > > > to
> > > > > > write>
> > > > > > > > the new entry marker. How can you recover the entry marker?>
> > > > > > > >>
> > > > > > > > 2.  When a broker crashes and recovers the managed ledger,
> the
> > > > > cursor>
> > > > > > > > ledger is not writable anymore. Are you going to create a new
> > > > cursor
> > > > > > ledger>
> > > > > > > > and copy all the entries from the old cursor ledger to the
> new
> > > > one?>
> > > > > > > >>
> > > > > > > > It would be good if you can clarify these two questions.>
> > > > > > > >>
> > > > > > > > - Sijie>
> > > > > > > >>
> > > > > > > > On Sun, Jan 17, 2021 at 9:48 PM linlin <li...@apache.org>
> > > wrote:>
> > > > > > > >>
> > > > > > > > > Hi, community:>
> > > > > > > > >     Recently we encountered some problems when using
> > > individual>
> > > > > > > > > acknowledgments, such as:>
> > > > > > > > > when the amount of acknowledgment is large, entry writing
> > > fails;
> > > > a
> > > > > > large>
> > > > > > > > > amount of cache causes OOM, etc.>
> > > > > > > > > So I drafted a PIP in `>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing`
> <https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing>
> > <
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >
> > > <
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > >
> > > > <
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > >
> > > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > >
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > > >
> > > > > > >
> > > > > >
> > > > > > > > <
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > > > >>
> > > > > >
> > > > > > > > > <>
> > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > > > >
> > > > > >
> > > > > > > > >>
> > > > > > > > > ,>
> > > > > > > > > any voice is welcomed.>
> > > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Sijie Guo <gu...@gmail.com>.

Matteo, Rajan, Joe, Addison, Penghui, and Jia have met today to discuss
this issue. We have come up with next steps:

In general, we have agreed to proceed with the PIP.

1. Penghui is going to update the PIP to clarify terminology and diagrams
in the PIP. He will also write more details about how the LRU cache work.
2. Matteo suggested adding the entry index to the existing PositionInfo. If
the number of the unacked ranges is lower than a certain threshold, use
existing approach, and only move to the split approach after it goes beyond
the threshold.
3. We agreed to make sure we still have a cap about the max unacked ranges.

If I missed anything, feel free to add them here.

The discussion recording is also available at
https://github.com/apache/pulsar/wiki/Community-Meetings#pip-discussions

Thanks,
Sijie


On Tue, Feb 2, 2021 at 9:20 PM Sijie Guo <gu...@gmail.com> wrote:

> @Rajan Dhabalia <rd...@apache.org>
>
> We discussed this PIP in the community meeting and agreed that we need to
> evaluate this PIP one more time to see if we can combine this PIP together
> with the bitmap proposal together as the final solution.
>
> We should think about more use cases before pushing the bitmap approach.
> Hence let's spend more time reviewing Lin's proposal and see if we can find
> an approach there.
>
> If there is no way to find a solution that solves all the problems, I
> would still suggest having an interface to allow implementing different
> algorithms. We can always have a basic algorithm that works for most of the
> use cases. But also allow advanced users to customize a solution that works
> for the other use cases.
>
>
> https://docs.google.com/document/d/19dXkVXeU2q_nHmkG8zURjKnYlvD96TbKf5KjYyASsOE/edit#bookmark=id.ekbmsiqz25jm
>
> Thanks,
> Sijie
>
> On Fri, Jan 29, 2021 at 1:59 AM Sijie Guo <gu...@gmail.com> wrote:
>
>> Rajan - I understand your concern about memory. Lin, Penghui, and I also
>> acknowledged that none of the implementations solve every use case. Each
>> implementation has its limitation and concerns.
>>
>> I am trying to find a way for both parties (You and Lin) can explore
>> different implementations. An abstraction/interface sounds like a
>> reasonable approach to take, no? We have a similar situation in the delayed
>> message scheduler. The current queue-based implementation works for certain
>> workloads but doesn't work for delayed messages spanning a long time span.
>> But the delayed message scheduler is an interface that allows people to
>> implement a better solution (i.e. a hashed-wheel-based implementation).
>> Does that make sense?
>>
>> Thanks,
>> Sijie
>>
>>
>> On Thu, Jan 28, 2021 at 11:54 AM Rajan Dhabalia <rd...@apache.org>
>> wrote:
>>
>>> My only point was, if the broker tries to manage more than 10M unack
>>> message ranges (per subscription) then the broker has to face many
>>> negative
>>> consequences in terms of memory and CPU, and then it will require to
>>> build
>>> additional debt to solve such problems. Therefore, I consider 10M unack
>>> message ranges are more than enough for any application, if not then we
>>> should prevent those applications to not go beyond such thresholds to
>>> handle the back-pressure of unack messages and not allowing applications
>>> to
>>> build a cache at broker-side.
>>>
>>> We have seen multiple issues in the past due to large numbers of unack
>>> messages in memory and to handle such abuse, brokers have configurations
>>> to
>>> allow max unack messages per consumer (maxUnackedMessagesPerConsumer:
>>> default 50K) and subscription (maxUnackedMessagesPerSubscription: default
>>> 200K). So, brokers already have large enough limits to store unack
>>> messages
>>> and going beyond that limit causes scaling issues in brokers, so better
>>> to
>>> keep the system stable and safe.
>>>
>>> Thanks,
>>> Rajan
>>>
>>> On Thu, Jan 28, 2021 at 9:50 AM Sijie Guo <gu...@gmail.com> wrote:
>>>
>>> > Agreed with Lin. I think we should try to abstract this into an
>>> interface
>>> > and allow different implementations.
>>> >
>>> > Rajan - what is your real concern making it abstract?
>>> >
>>> > - Sijie
>>> >
>>> > On Wed, Jan 27, 2021 at 7:37 PM Lin Lin <li...@apache.org> wrote:
>>> >
>>> > > Hi Rajan,
>>> > > Thank you for your PR.
>>> > > The main difference lies in whether 10MB is enough and memory
>>> doubling
>>> > > problem, which is caused by different business scenarios.
>>> > > In some business scenario, the QPS of 20k/s is considered to be very
>>> low,
>>> > > and requests exceeding this order of magnitude are common.
>>> > > If it is only increased to 10MB, the time exceeding the threshold
>>> only
>>> > > changes from 30 seconds to 60 seconds, and the problems in PIP are
>>> still
>>> > > not solved.
>>> > > "large enough" may be base on your scenario, and in some scenario,
>>> it is
>>> > > not enough in most cases...
>>> > > Because the problem has not been solved, I suggest to abstract, so
>>> that
>>> > > different people can choose.
>>> > > Your PR is an improvement to the current performance, there is no
>>> > conflict
>>> > > between them.
>>> > >
>>> > > Thanks
>>> > >
>>> > > On 2021/01/27 03:50:07, Rajan Dhabalia <rd...@apache.org> wrote:
>>> > > > I have created a PR which should allow brokers to store up to 10M
>>> > > > unack-message ranges. I think it should be large enough for any
>>> > usecases
>>> > > > and probably now, we might not need to introduce abstraction for
>>> ack
>>> > > > management to avoid any further complexity in message
>>> acknowledgement
>>> > > path
>>> > > > as well.
>>> > > > https://github.com/apache/pulsar/pull/9292
>>> > > >
>>> > > > Thanks,
>>> > > > Rajan
>>> > >
>>> > >
>>> >
>>>
>>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Sijie Guo <gu...@gmail.com>.

@Rajan Dhabalia <rd...@apache.org>

We discussed this PIP in the community meeting and agreed that we need to
evaluate this PIP one more time to see if we can combine this PIP together
with the bitmap proposal together as the final solution.

We should think about more use cases before pushing the bitmap approach.
Hence let's spend more time reviewing Lin's proposal and see if we can find
an approach there.

If there is no way to find a solution that solves all the problems, I would
still suggest having an interface to allow implementing different
algorithms. We can always have a basic algorithm that works for most of the
use cases. But also allow advanced users to customize a solution that works
for the other use cases.

https://docs.google.com/document/d/19dXkVXeU2q_nHmkG8zURjKnYlvD96TbKf5KjYyASsOE/edit#bookmark=id.ekbmsiqz25jm

Thanks,
Sijie

On Fri, Jan 29, 2021 at 1:59 AM Sijie Guo <gu...@gmail.com> wrote:

> Rajan - I understand your concern about memory. Lin, Penghui, and I also
> acknowledged that none of the implementations solve every use case. Each
> implementation has its limitation and concerns.
>
> I am trying to find a way for both parties (You and Lin) can explore
> different implementations. An abstraction/interface sounds like a
> reasonable approach to take, no? We have a similar situation in the delayed
> message scheduler. The current queue-based implementation works for certain
> workloads but doesn't work for delayed messages spanning a long time span.
> But the delayed message scheduler is an interface that allows people to
> implement a better solution (i.e. a hashed-wheel-based implementation).
> Does that make sense?
>
> Thanks,
> Sijie
>
>
> On Thu, Jan 28, 2021 at 11:54 AM Rajan Dhabalia <rd...@apache.org>
> wrote:
>
>> My only point was, if the broker tries to manage more than 10M unack
>> message ranges (per subscription) then the broker has to face many
>> negative
>> consequences in terms of memory and CPU, and then it will require to build
>> additional debt to solve such problems. Therefore, I consider 10M unack
>> message ranges are more than enough for any application, if not then we
>> should prevent those applications to not go beyond such thresholds to
>> handle the back-pressure of unack messages and not allowing applications
>> to
>> build a cache at broker-side.
>>
>> We have seen multiple issues in the past due to large numbers of unack
>> messages in memory and to handle such abuse, brokers have configurations
>> to
>> allow max unack messages per consumer (maxUnackedMessagesPerConsumer:
>> default 50K) and subscription (maxUnackedMessagesPerSubscription: default
>> 200K). So, brokers already have large enough limits to store unack
>> messages
>> and going beyond that limit causes scaling issues in brokers, so better to
>> keep the system stable and safe.
>>
>> Thanks,
>> Rajan
>>
>> On Thu, Jan 28, 2021 at 9:50 AM Sijie Guo <gu...@gmail.com> wrote:
>>
>> > Agreed with Lin. I think we should try to abstract this into an
>> interface
>> > and allow different implementations.
>> >
>> > Rajan - what is your real concern making it abstract?
>> >
>> > - Sijie
>> >
>> > On Wed, Jan 27, 2021 at 7:37 PM Lin Lin <li...@apache.org> wrote:
>> >
>> > > Hi Rajan,
>> > > Thank you for your PR.
>> > > The main difference lies in whether 10MB is enough and memory doubling
>> > > problem, which is caused by different business scenarios.
>> > > In some business scenario, the QPS of 20k/s is considered to be very
>> low,
>> > > and requests exceeding this order of magnitude are common.
>> > > If it is only increased to 10MB, the time exceeding the threshold only
>> > > changes from 30 seconds to 60 seconds, and the problems in PIP are
>> still
>> > > not solved.
>> > > "large enough" may be base on your scenario, and in some scenario, it
>> is
>> > > not enough in most cases...
>> > > Because the problem has not been solved, I suggest to abstract, so
>> that
>> > > different people can choose.
>> > > Your PR is an improvement to the current performance, there is no
>> > conflict
>> > > between them.
>> > >
>> > > Thanks
>> > >
>> > > On 2021/01/27 03:50:07, Rajan Dhabalia <rd...@apache.org> wrote:
>> > > > I have created a PR which should allow brokers to store up to 10M
>> > > > unack-message ranges. I think it should be large enough for any
>> > usecases
>> > > > and probably now, we might not need to introduce abstraction for ack
>> > > > management to avoid any further complexity in message
>> acknowledgement
>> > > path
>> > > > as well.
>> > > > https://github.com/apache/pulsar/pull/9292
>> > > >
>> > > > Thanks,
>> > > > Rajan
>> > >
>> > >
>> >
>>
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Sijie Guo <gu...@gmail.com>.

Rajan - I understand your concern about memory. Lin, Penghui, and I also
acknowledged that none of the implementations solve every use case. Each
implementation has its limitation and concerns.

I am trying to find a way for both parties (You and Lin) can explore
different implementations. An abstraction/interface sounds like a
reasonable approach to take, no? We have a similar situation in the delayed
message scheduler. The current queue-based implementation works for certain
workloads but doesn't work for delayed messages spanning a long time span.
But the delayed message scheduler is an interface that allows people to
implement a better solution (i.e. a hashed-wheel-based implementation).
Does that make sense?

Thanks,
Sijie


On Thu, Jan 28, 2021 at 11:54 AM Rajan Dhabalia <rd...@apache.org>
wrote:

> My only point was, if the broker tries to manage more than 10M unack
> message ranges (per subscription) then the broker has to face many negative
> consequences in terms of memory and CPU, and then it will require to build
> additional debt to solve such problems. Therefore, I consider 10M unack
> message ranges are more than enough for any application, if not then we
> should prevent those applications to not go beyond such thresholds to
> handle the back-pressure of unack messages and not allowing applications to
> build a cache at broker-side.
>
> We have seen multiple issues in the past due to large numbers of unack
> messages in memory and to handle such abuse, brokers have configurations to
> allow max unack messages per consumer (maxUnackedMessagesPerConsumer:
> default 50K) and subscription (maxUnackedMessagesPerSubscription: default
> 200K). So, brokers already have large enough limits to store unack messages
> and going beyond that limit causes scaling issues in brokers, so better to
> keep the system stable and safe.
>
> Thanks,
> Rajan
>
> On Thu, Jan 28, 2021 at 9:50 AM Sijie Guo <gu...@gmail.com> wrote:
>
> > Agreed with Lin. I think we should try to abstract this into an interface
> > and allow different implementations.
> >
> > Rajan - what is your real concern making it abstract?
> >
> > - Sijie
> >
> > On Wed, Jan 27, 2021 at 7:37 PM Lin Lin <li...@apache.org> wrote:
> >
> > > Hi Rajan,
> > > Thank you for your PR.
> > > The main difference lies in whether 10MB is enough and memory doubling
> > > problem, which is caused by different business scenarios.
> > > In some business scenario, the QPS of 20k/s is considered to be very
> low,
> > > and requests exceeding this order of magnitude are common.
> > > If it is only increased to 10MB, the time exceeding the threshold only
> > > changes from 30 seconds to 60 seconds, and the problems in PIP are
> still
> > > not solved.
> > > "large enough" may be base on your scenario, and in some scenario, it
> is
> > > not enough in most cases...
> > > Because the problem has not been solved, I suggest to abstract, so that
> > > different people can choose.
> > > Your PR is an improvement to the current performance, there is no
> > conflict
> > > between them.
> > >
> > > Thanks
> > >
> > > On 2021/01/27 03:50:07, Rajan Dhabalia <rd...@apache.org> wrote:
> > > > I have created a PR which should allow brokers to store up to 10M
> > > > unack-message ranges. I think it should be large enough for any
> > usecases
> > > > and probably now, we might not need to introduce abstraction for ack
> > > > management to avoid any further complexity in message acknowledgement
> > > path
> > > > as well.
> > > > https://github.com/apache/pulsar/pull/9292
> > > >
> > > > Thanks,
> > > > Rajan
> > >
> > >
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Rajan Dhabalia <rd...@apache.org>.

My only point was, if the broker tries to manage more than 10M unack
message ranges (per subscription) then the broker has to face many negative
consequences in terms of memory and CPU, and then it will require to build
additional debt to solve such problems. Therefore, I consider 10M unack
message ranges are more than enough for any application, if not then we
should prevent those applications to not go beyond such thresholds to
handle the back-pressure of unack messages and not allowing applications to
build a cache at broker-side.

We have seen multiple issues in the past due to large numbers of unack
messages in memory and to handle such abuse, brokers have configurations to
allow max unack messages per consumer (maxUnackedMessagesPerConsumer:
default 50K) and subscription (maxUnackedMessagesPerSubscription: default
200K). So, brokers already have large enough limits to store unack messages
and going beyond that limit causes scaling issues in brokers, so better to
keep the system stable and safe.

Thanks,
Rajan

On Thu, Jan 28, 2021 at 9:50 AM Sijie Guo <gu...@gmail.com> wrote:

> Agreed with Lin. I think we should try to abstract this into an interface
> and allow different implementations.
>
> Rajan - what is your real concern making it abstract?
>
> - Sijie
>
> On Wed, Jan 27, 2021 at 7:37 PM Lin Lin <li...@apache.org> wrote:
>
> > Hi Rajan,
> > Thank you for your PR.
> > The main difference lies in whether 10MB is enough and memory doubling
> > problem, which is caused by different business scenarios.
> > In some business scenario, the QPS of 20k/s is considered to be very low,
> > and requests exceeding this order of magnitude are common.
> > If it is only increased to 10MB, the time exceeding the threshold only
> > changes from 30 seconds to 60 seconds, and the problems in PIP are still
> > not solved.
> > "large enough" may be base on your scenario, and in some scenario, it is
> > not enough in most cases...
> > Because the problem has not been solved, I suggest to abstract, so that
> > different people can choose.
> > Your PR is an improvement to the current performance, there is no
> conflict
> > between them.
> >
> > Thanks
> >
> > On 2021/01/27 03:50:07, Rajan Dhabalia <rd...@apache.org> wrote:
> > > I have created a PR which should allow brokers to store up to 10M
> > > unack-message ranges. I think it should be large enough for any
> usecases
> > > and probably now, we might not need to introduce abstraction for ack
> > > management to avoid any further complexity in message acknowledgement
> > path
> > > as well.
> > > https://github.com/apache/pulsar/pull/9292
> > >
> > > Thanks,
> > > Rajan
> >
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Sijie Guo <gu...@gmail.com>.

Agreed with Lin. I think we should try to abstract this into an interface
and allow different implementations.

Rajan - what is your real concern making it abstract?

- Sijie

On Wed, Jan 27, 2021 at 7:37 PM Lin Lin <li...@apache.org> wrote:

> Hi Rajan,
> Thank you for your PR.
> The main difference lies in whether 10MB is enough and memory doubling
> problem, which is caused by different business scenarios.
> In some business scenario, the QPS of 20k/s is considered to be very low,
> and requests exceeding this order of magnitude are common.
> If it is only increased to 10MB, the time exceeding the threshold only
> changes from 30 seconds to 60 seconds, and the problems in PIP are still
> not solved.
> "large enough" may be base on your scenario, and in some scenario, it is
> not enough in most cases...
> Because the problem has not been solved, I suggest to abstract, so that
> different people can choose.
> Your PR is an improvement to the current performance, there is no conflict
> between them.
>
> Thanks
>
> On 2021/01/27 03:50:07, Rajan Dhabalia <rd...@apache.org> wrote:
> > I have created a PR which should allow brokers to store up to 10M
> > unack-message ranges. I think it should be large enough for any usecases
> > and probably now, we might not need to introduce abstraction for ack
> > management to avoid any further complexity in message acknowledgement
> path
> > as well.
> > https://github.com/apache/pulsar/pull/9292
> >
> > Thanks,
> > Rajan
>
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Lin Lin <li...@apache.org>.

Hi Rajan,
Thank you for your PR.
The main difference lies in whether 10MB is enough and memory doubling problem, which is caused by different business scenarios.
In some business scenario, the QPS of 20k/s is considered to be very low, and requests exceeding this order of magnitude are common. 
If it is only increased to 10MB, the time exceeding the threshold only changes from 30 seconds to 60 seconds, and the problems in PIP are still not solved.
"large enough" may be base on your scenario, and in some scenario, it is not enough in most cases...
Because the problem has not been solved, I suggest to abstract, so that different people can choose.
Your PR is an improvement to the current performance, there is no conflict between them.

Thanks

On 2021/01/27 03:50:07, Rajan Dhabalia <rd...@apache.org> wrote: 
> I have created a PR which should allow brokers to store up to 10M
> unack-message ranges. I think it should be large enough for any usecases
> and probably now, we might not need to introduce abstraction for ack
> management to avoid any further complexity in message acknowledgement path
> as well.
> https://github.com/apache/pulsar/pull/9292
> 
> Thanks,
> Rajan

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Rajan Dhabalia <rd...@apache.org>.

I have created a PR which should allow brokers to store up to 10M
unack-message ranges. I think it should be large enough for any usecases
and probably now, we might not need to introduce abstraction for ack
management to avoid any further complexity in message acknowledgement path
as well.
https://github.com/apache/pulsar/pull/9292

Thanks,
Rajan



On Sun, Jan 24, 2021 at 7:14 PM Sijie Guo <gu...@gmail.com> wrote:

> Rajan,
>
> Thank you for sharing the prototype! That looks great to me.
>
> In order for the community to evolve and experiment with different
> approaches, how about we abstract this acknowledgment management as an
> interface?
>
> If this approach works for you, maybe Lin Lin can focus on abstracting this
> interface. Then we can implement your approach and his approach using this
> interface. Users can decide which implementation to use.
>
> Thanks,
> Sijie
>
> On Sat, Jan 23, 2021 at 7:46 PM Rajan Dhabalia <rd...@apache.org>
> wrote:
>
> > *Hi, >> Delayed messages or certain user logic can introduce a lot of
> > message-holes. We have seen this issue in quite a lot of customers'
> > production environment.I agree that we need a larger buffer to store and
> > recover individually deleted messages, and the existing 150K limit might
> > not be enough for many usescases. However, I would also like to highlight
> > issues that happened in the past due to keeping a large number of deleted
> > messages in the broker’s memory and one of them was high GC pauses.
> > Therefore, we introduced ConcurrentOpenLongPairRangeSet
> > <https://github.com/apache/pulsar/pull/3818> to manage deleted messages
> > without actually storing range objects in memory. OpenRangeSet uses
> bitset
> > to store ranges in memory and we can also utilize it to persist in disk
> for
> > the recovery. This approach has various advantages: simple
> implementation,
> > large enough range for recovery, and it skips intermediate conversion
> from
> > unack-messages to bitset in OpenRangeSet which saves extra CPU while
> > recovery.I implemented a simple prototype
> > <
> >
> https://github.com/rdhabalia/pulsar/commit/1f8e5e745e9f1d1429697b5dee1da70545385653
> > >
> > to store deleted messages using bitset in OpenRangeSet and we can persist
> > 10M ranges with 5MB data size which I guess is large enough for any
> > usecases. So, we can use this approach to solve the problem without
> > introducing unnecessary complexity in managed-cursor.Thanks,Rajan*
> >
> > On Fri, Jan 22, 2021 at 7:52 PM Sijie Guo <gu...@gmail.com> wrote:
> >
> > > Joe - Delayed messages or certain user logic can introduce a lot of
> > message
> > > holes. We have seen this issue in quite a lot of customers' production
> > > environment. Hence we need to find a solution for solving these
> problems.
> > > If you are skeptical of an implementation like that, how about us
> making
> > > cursor implementation pluggable. We can make this proposal implemented
> as
> > > one plugin. So it will not impact any existing logic but allowing
> people
> > > use a plugin to solve this problem.
> > >
> > > Thanks,
> > > Sijie
> > >
> > > On Fri, Jan 22, 2021 at 5:00 PM Joe Francis
> > <joef@verizonmedia.com.invalid
> > > >
> > > wrote:
> > >
> > > > Let me take a step back and explain  how I am looking at this from a
> > > > high-level
> > > > design viewpoint
> > > >
> > > >
> > > > Bookkeeper (BK) is like an LSM implementation of a KV store. Writes
> to
> > > all
> > > > keys are appended to a single file; deletes are logical.  Compaction
> > > > reclaims space.  An Index is used locate entries, tracking logical
> > > deletes
> > > > and reclaim space.
> > > >
> > > >
> > > > The index in BK  is another LSM.  Again, writes are appended, deletes
> > are
> > > > logical, and  an index is used to  locate entries , account for
> deletes
> > > and
> > > > compaction to reclaim space (the implementation within rocksdb is far
> > > more
> > > > complex with bloom filters and memtables, but you get the idea )   BK
> > > just
> > > > uses a sophisticated index (rocksdb) which is tiny and cacheable and
> > > > rocksdb has within it a sophisticated index which is small and
> > cacheable
> > > >
> > > >
> > > > So when I look at this proposal, what I see is the same - another
> > attempt
> > > > to build an LSM with a sophisticated index/cache mechanism using log
> > > > structured storage. So I am quite skeptical that this needs to solved
> > > this
> > > > way,  within Pulsar.
> > > >
> > > >
> > > >
> > > > Joe
> > > >
> > > > On Wed, Jan 20, 2021 at 12:30 AM linlin <li...@apache.org> wrote:
> > > >
> > > > > We can look at
> ManagedCursorImpl.buildIndividualDeletedMessageRanges
> > > > >
> > > > > What is saved in the entry is not a bitSet, but a messageRange one
> by
> > > > one,
> > > > > which contains information such as ledgerId and entryId. BitSet
> only
> > > > exists
> > > > > in the memory and is used to quickly determine whether it already
> > > exists.
> > > > > In addition, the position of each ack will be stored in the
> > > > > individualDeletedMessages queue. When persisted to the entry, the
> > queue
> > > > > will be traversed, and the position information of each ack will
> > > > generate a
> > > > > messageRange.
> > > > > A messageRange contains lowerEndpoint (ledgerId+entryId),
> > upperEndpoint
> > > > > (ledgerId+entryId), 4 longs, about 256 bits.
> > > > >
> > > > > We assume a more extreme scenario, 300K messages, every other ack
> has
> > > an
> > > > > unacknowledged, that is, 150K location information will be stored
> in
> > > > > individualDeletedMessages. 150K * 256/8/1024 /1024 ≈ 4.6MB
> > > > > Of course, there are also scenarios where the customer's ack spans
> > > > several
> > > > > ledgers.
> > > > >
> > > > >
> > > > > On 2021/01/20 00:38:47, Joe F <j....@gmail.com> wrote:
> > > > > > I have a simpler question. Just storing the message-ids raw will
> > fit
> > > > > ~300K>
> > > > > > entries in one ledger entry. With the bitmap  changes, we can
> store
> > > a>
> > > > > > couple of million  within one 5MB ledger entry.  So can you tell
> us
> > > > what>
> > > > > > numbers of unacked messages are  creating a problem?  What
> exactly
> > > are
> > > > > the>
> > > > > > issues you face, and at what numbers of unacked messages/memory
> use
> > > > etc?>
> > > > > >
> > > > > > I have my own concerns about this proposal, but I would like to
> > > > > understand>
> > > > > > the problem first>
> > > > > >
> > > > > > Joe>
> > > > > >
> > > > > > On Sun, Jan 17, 2021 at 10:16 PM Sijie Guo <gu...@gmail.com>
> > wrote:>
> > > > > >
> > > > > > > Hi Lin,>
> > > > > > >>
> > > > > > > Thanks you and Penghui for drafting this! We have seen a lot of
> > > pain
> > > > > points>
> > > > > > > of `managedLedgerMaxUnackedRangesToPersist` when enabling
> delayed
> > > > > messages.>
> > > > > > > Glad that you and Penghui are spending time on resolving this!>
> > > > > > >>
> > > > > > > Overall the proposal looks good. But I have a couple of
> questions
> > > > about
> > > > > the>
> > > > > > > proposal.>
> > > > > > >>
> > > > > > > 1. What happens if the broker fails to write the entry marker?
> > For
> > > > > example,>
> > > > > > > at t0, the broker flushes dirty pages and successfully writes
> an
> > > > entry>
> > > > > > > marker. At t1, the broker tries to flushes dirty pages but
> failed
> > > to
> > > > > write>
> > > > > > > the new entry marker. How can you recover the entry marker?>
> > > > > > >>
> > > > > > > 2.  When a broker crashes and recovers the managed ledger, the
> > > > cursor>
> > > > > > > ledger is not writable anymore. Are you going to create a new
> > > cursor
> > > > > ledger>
> > > > > > > and copy all the entries from the old cursor ledger to the new
> > > one?>
> > > > > > >>
> > > > > > > It would be good if you can clarify these two questions.>
> > > > > > >>
> > > > > > > - Sijie>
> > > > > > >>
> > > > > > > On Sun, Jan 17, 2021 at 9:48 PM linlin <li...@apache.org>
> > wrote:>
> > > > > > >>
> > > > > > > > Hi, community:>
> > > > > > > >     Recently we encountered some problems when using
> > individual>
> > > > > > > > acknowledgments, such as:>
> > > > > > > > when the amount of acknowledgment is large, entry writing
> > fails;
> > > a
> > > > > large>
> > > > > > > > amount of cache causes OOM, etc.>
> > > > > > > > So I drafted a PIP in `>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing`
> <https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing>
> > <
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >
> > > <
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > >
> > > > <
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > >
> > > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > >
> > > > > >
> > > > >
> > > > > > > <
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > > >>
> > > > >
> > > > > > > > <>
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > > >
> > > > >
> > > > > > > >>
> > > > > > > > ,>
> > > > > > > > any voice is welcomed.>
> > > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by PengHui Li <pe...@apache.org>.

Thanks for all your suggestion, LinLin and I will work on abstract this
acknowledgment management as an
interface.

- Penghui

Sijie Guo <gu...@gmail.com> 于2021年1月25日周一 上午11:14写道：

> Rajan,
>
> Thank you for sharing the prototype! That looks great to me.
>
> In order for the community to evolve and experiment with different
> approaches, how about we abstract this acknowledgment management as an
> interface?
>
> If this approach works for you, maybe Lin Lin can focus on abstracting this
> interface. Then we can implement your approach and his approach using this
> interface. Users can decide which implementation to use.
>
> Thanks,
> Sijie
>
> On Sat, Jan 23, 2021 at 7:46 PM Rajan Dhabalia <rd...@apache.org>
> wrote:
>
> > *Hi, >> Delayed messages or certain user logic can introduce a lot of
> > message-holes. We have seen this issue in quite a lot of customers'
> > production environment.I agree that we need a larger buffer to store and
> > recover individually deleted messages, and the existing 150K limit might
> > not be enough for many usescases. However, I would also like to highlight
> > issues that happened in the past due to keeping a large number of deleted
> > messages in the broker’s memory and one of them was high GC pauses.
> > Therefore, we introduced ConcurrentOpenLongPairRangeSet
> > <https://github.com/apache/pulsar/pull/3818> to manage deleted messages
> > without actually storing range objects in memory. OpenRangeSet uses
> bitset
> > to store ranges in memory and we can also utilize it to persist in disk
> for
> > the recovery. This approach has various advantages: simple
> implementation,
> > large enough range for recovery, and it skips intermediate conversion
> from
> > unack-messages to bitset in OpenRangeSet which saves extra CPU while
> > recovery.I implemented a simple prototype
> > <
> >
> https://github.com/rdhabalia/pulsar/commit/1f8e5e745e9f1d1429697b5dee1da70545385653
> > >
> > to store deleted messages using bitset in OpenRangeSet and we can persist
> > 10M ranges with 5MB data size which I guess is large enough for any
> > usecases. So, we can use this approach to solve the problem without
> > introducing unnecessary complexity in managed-cursor.Thanks,Rajan*
> >
> > On Fri, Jan 22, 2021 at 7:52 PM Sijie Guo <gu...@gmail.com> wrote:
> >
> > > Joe - Delayed messages or certain user logic can introduce a lot of
> > message
> > > holes. We have seen this issue in quite a lot of customers' production
> > > environment. Hence we need to find a solution for solving these
> problems.
> > > If you are skeptical of an implementation like that, how about us
> making
> > > cursor implementation pluggable. We can make this proposal implemented
> as
> > > one plugin. So it will not impact any existing logic but allowing
> people
> > > use a plugin to solve this problem.
> > >
> > > Thanks,
> > > Sijie
> > >
> > > On Fri, Jan 22, 2021 at 5:00 PM Joe Francis
> > <joef@verizonmedia.com.invalid
> > > >
> > > wrote:
> > >
> > > > Let me take a step back and explain  how I am looking at this from a
> > > > high-level
> > > > design viewpoint
> > > >
> > > >
> > > > Bookkeeper (BK) is like an LSM implementation of a KV store. Writes
> to
> > > all
> > > > keys are appended to a single file; deletes are logical.  Compaction
> > > > reclaims space.  An Index is used locate entries, tracking logical
> > > deletes
> > > > and reclaim space.
> > > >
> > > >
> > > > The index in BK  is another LSM.  Again, writes are appended, deletes
> > are
> > > > logical, and  an index is used to  locate entries , account for
> deletes
> > > and
> > > > compaction to reclaim space (the implementation within rocksdb is far
> > > more
> > > > complex with bloom filters and memtables, but you get the idea )   BK
> > > just
> > > > uses a sophisticated index (rocksdb) which is tiny and cacheable and
> > > > rocksdb has within it a sophisticated index which is small and
> > cacheable
> > > >
> > > >
> > > > So when I look at this proposal, what I see is the same - another
> > attempt
> > > > to build an LSM with a sophisticated index/cache mechanism using log
> > > > structured storage. So I am quite skeptical that this needs to solved
> > > this
> > > > way,  within Pulsar.
> > > >
> > > >
> > > >
> > > > Joe
> > > >
> > > > On Wed, Jan 20, 2021 at 12:30 AM linlin <li...@apache.org> wrote:
> > > >
> > > > > We can look at
> ManagedCursorImpl.buildIndividualDeletedMessageRanges
> > > > >
> > > > > What is saved in the entry is not a bitSet, but a messageRange one
> by
> > > > one,
> > > > > which contains information such as ledgerId and entryId. BitSet
> only
> > > > exists
> > > > > in the memory and is used to quickly determine whether it already
> > > exists.
> > > > > In addition, the position of each ack will be stored in the
> > > > > individualDeletedMessages queue. When persisted to the entry, the
> > queue
> > > > > will be traversed, and the position information of each ack will
> > > > generate a
> > > > > messageRange.
> > > > > A messageRange contains lowerEndpoint (ledgerId+entryId),
> > upperEndpoint
> > > > > (ledgerId+entryId), 4 longs, about 256 bits.
> > > > >
> > > > > We assume a more extreme scenario, 300K messages, every other ack
> has
> > > an
> > > > > unacknowledged, that is, 150K location information will be stored
> in
> > > > > individualDeletedMessages. 150K * 256/8/1024 /1024 ≈ 4.6MB
> > > > > Of course, there are also scenarios where the customer's ack spans
> > > > several
> > > > > ledgers.
> > > > >
> > > > >
> > > > > On 2021/01/20 00:38:47, Joe F <j....@gmail.com> wrote:
> > > > > > I have a simpler question. Just storing the message-ids raw will
> > fit
> > > > > ~300K>
> > > > > > entries in one ledger entry. With the bitmap  changes, we can
> store
> > > a>
> > > > > > couple of million  within one 5MB ledger entry.  So can you tell
> us
> > > > what>
> > > > > > numbers of unacked messages are  creating a problem?  What
> exactly
> > > are
> > > > > the>
> > > > > > issues you face, and at what numbers of unacked messages/memory
> use
> > > > etc?>
> > > > > >
> > > > > > I have my own concerns about this proposal, but I would like to
> > > > > understand>
> > > > > > the problem first>
> > > > > >
> > > > > > Joe>
> > > > > >
> > > > > > On Sun, Jan 17, 2021 at 10:16 PM Sijie Guo <gu...@gmail.com>
> > wrote:>
> > > > > >
> > > > > > > Hi Lin,>
> > > > > > >>
> > > > > > > Thanks you and Penghui for drafting this! We have seen a lot of
> > > pain
> > > > > points>
> > > > > > > of `managedLedgerMaxUnackedRangesToPersist` when enabling
> delayed
> > > > > messages.>
> > > > > > > Glad that you and Penghui are spending time on resolving this!>
> > > > > > >>
> > > > > > > Overall the proposal looks good. But I have a couple of
> questions
> > > > about
> > > > > the>
> > > > > > > proposal.>
> > > > > > >>
> > > > > > > 1. What happens if the broker fails to write the entry marker?
> > For
> > > > > example,>
> > > > > > > at t0, the broker flushes dirty pages and successfully writes
> an
> > > > entry>
> > > > > > > marker. At t1, the broker tries to flushes dirty pages but
> failed
> > > to
> > > > > write>
> > > > > > > the new entry marker. How can you recover the entry marker?>
> > > > > > >>
> > > > > > > 2.  When a broker crashes and recovers the managed ledger, the
> > > > cursor>
> > > > > > > ledger is not writable anymore. Are you going to create a new
> > > cursor
> > > > > ledger>
> > > > > > > and copy all the entries from the old cursor ledger to the new
> > > one?>
> > > > > > >>
> > > > > > > It would be good if you can clarify these two questions.>
> > > > > > >>
> > > > > > > - Sijie>
> > > > > > >>
> > > > > > > On Sun, Jan 17, 2021 at 9:48 PM linlin <li...@apache.org>
> > wrote:>
> > > > > > >>
> > > > > > > > Hi, community:>
> > > > > > > >     Recently we encountered some problems when using
> > individual>
> > > > > > > > acknowledgments, such as:>
> > > > > > > > when the amount of acknowledgment is large, entry writing
> > fails;
> > > a
> > > > > large>
> > > > > > > > amount of cache causes OOM, etc.>
> > > > > > > > So I drafted a PIP in `>
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing`
> <https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing>
> > <
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >
> > > <
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > >
> > > > <
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > >
> > > > > <
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > >
> > > > > >
> > > > >
> > > > > > > <
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > > >>
> > > > >
> > > > > > > > <>
> > > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > > >
> > > > >
> > > > > > > >>
> > > > > > > > ,>
> > > > > > > > any voice is welcomed.>
> > > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Sijie Guo <gu...@gmail.com>.

Rajan,

Thank you for sharing the prototype! That looks great to me.

In order for the community to evolve and experiment with different
approaches, how about we abstract this acknowledgment management as an
interface?

If this approach works for you, maybe Lin Lin can focus on abstracting this
interface. Then we can implement your approach and his approach using this
interface. Users can decide which implementation to use.

Thanks,
Sijie

On Sat, Jan 23, 2021 at 7:46 PM Rajan Dhabalia <rd...@apache.org> wrote:

> *Hi, >> Delayed messages or certain user logic can introduce a lot of
> message-holes. We have seen this issue in quite a lot of customers'
> production environment.I agree that we need a larger buffer to store and
> recover individually deleted messages, and the existing 150K limit might
> not be enough for many usescases. However, I would also like to highlight
> issues that happened in the past due to keeping a large number of deleted
> messages in the broker’s memory and one of them was high GC pauses.
> Therefore, we introduced ConcurrentOpenLongPairRangeSet
> <https://github.com/apache/pulsar/pull/3818> to manage deleted messages
> without actually storing range objects in memory. OpenRangeSet uses bitset
> to store ranges in memory and we can also utilize it to persist in disk for
> the recovery. This approach has various advantages: simple implementation,
> large enough range for recovery, and it skips intermediate conversion from
> unack-messages to bitset in OpenRangeSet which saves extra CPU while
> recovery.I implemented a simple prototype
> <
> https://github.com/rdhabalia/pulsar/commit/1f8e5e745e9f1d1429697b5dee1da70545385653
> >
> to store deleted messages using bitset in OpenRangeSet and we can persist
> 10M ranges with 5MB data size which I guess is large enough for any
> usecases. So, we can use this approach to solve the problem without
> introducing unnecessary complexity in managed-cursor.Thanks,Rajan*
>
> On Fri, Jan 22, 2021 at 7:52 PM Sijie Guo <gu...@gmail.com> wrote:
>
> > Joe - Delayed messages or certain user logic can introduce a lot of
> message
> > holes. We have seen this issue in quite a lot of customers' production
> > environment. Hence we need to find a solution for solving these problems.
> > If you are skeptical of an implementation like that, how about us making
> > cursor implementation pluggable. We can make this proposal implemented as
> > one plugin. So it will not impact any existing logic but allowing people
> > use a plugin to solve this problem.
> >
> > Thanks,
> > Sijie
> >
> > On Fri, Jan 22, 2021 at 5:00 PM Joe Francis
> <joef@verizonmedia.com.invalid
> > >
> > wrote:
> >
> > > Let me take a step back and explain  how I am looking at this from a
> > > high-level
> > > design viewpoint
> > >
> > >
> > > Bookkeeper (BK) is like an LSM implementation of a KV store. Writes to
> > all
> > > keys are appended to a single file; deletes are logical.  Compaction
> > > reclaims space.  An Index is used locate entries, tracking logical
> > deletes
> > > and reclaim space.
> > >
> > >
> > > The index in BK  is another LSM.  Again, writes are appended, deletes
> are
> > > logical, and  an index is used to  locate entries , account for deletes
> > and
> > > compaction to reclaim space (the implementation within rocksdb is far
> > more
> > > complex with bloom filters and memtables, but you get the idea )   BK
> > just
> > > uses a sophisticated index (rocksdb) which is tiny and cacheable and
> > > rocksdb has within it a sophisticated index which is small and
> cacheable
> > >
> > >
> > > So when I look at this proposal, what I see is the same - another
> attempt
> > > to build an LSM with a sophisticated index/cache mechanism using log
> > > structured storage. So I am quite skeptical that this needs to solved
> > this
> > > way,  within Pulsar.
> > >
> > >
> > >
> > > Joe
> > >
> > > On Wed, Jan 20, 2021 at 12:30 AM linlin <li...@apache.org> wrote:
> > >
> > > > We can look at ManagedCursorImpl.buildIndividualDeletedMessageRanges
> > > >
> > > > What is saved in the entry is not a bitSet, but a messageRange one by
> > > one,
> > > > which contains information such as ledgerId and entryId. BitSet only
> > > exists
> > > > in the memory and is used to quickly determine whether it already
> > exists.
> > > > In addition, the position of each ack will be stored in the
> > > > individualDeletedMessages queue. When persisted to the entry, the
> queue
> > > > will be traversed, and the position information of each ack will
> > > generate a
> > > > messageRange.
> > > > A messageRange contains lowerEndpoint (ledgerId+entryId),
> upperEndpoint
> > > > (ledgerId+entryId), 4 longs, about 256 bits.
> > > >
> > > > We assume a more extreme scenario, 300K messages, every other ack has
> > an
> > > > unacknowledged, that is, 150K location information will be stored in
> > > > individualDeletedMessages. 150K * 256/8/1024 /1024 ≈ 4.6MB
> > > > Of course, there are also scenarios where the customer's ack spans
> > > several
> > > > ledgers.
> > > >
> > > >
> > > > On 2021/01/20 00:38:47, Joe F <j....@gmail.com> wrote:
> > > > > I have a simpler question. Just storing the message-ids raw will
> fit
> > > > ~300K>
> > > > > entries in one ledger entry. With the bitmap  changes, we can store
> > a>
> > > > > couple of million  within one 5MB ledger entry.  So can you tell us
> > > what>
> > > > > numbers of unacked messages are  creating a problem?  What exactly
> > are
> > > > the>
> > > > > issues you face, and at what numbers of unacked messages/memory use
> > > etc?>
> > > > >
> > > > > I have my own concerns about this proposal, but I would like to
> > > > understand>
> > > > > the problem first>
> > > > >
> > > > > Joe>
> > > > >
> > > > > On Sun, Jan 17, 2021 at 10:16 PM Sijie Guo <gu...@gmail.com>
> wrote:>
> > > > >
> > > > > > Hi Lin,>
> > > > > >>
> > > > > > Thanks you and Penghui for drafting this! We have seen a lot of
> > pain
> > > > points>
> > > > > > of `managedLedgerMaxUnackedRangesToPersist` when enabling delayed
> > > > messages.>
> > > > > > Glad that you and Penghui are spending time on resolving this!>
> > > > > >>
> > > > > > Overall the proposal looks good. But I have a couple of questions
> > > about
> > > > the>
> > > > > > proposal.>
> > > > > >>
> > > > > > 1. What happens if the broker fails to write the entry marker?
> For
> > > > example,>
> > > > > > at t0, the broker flushes dirty pages and successfully writes an
> > > entry>
> > > > > > marker. At t1, the broker tries to flushes dirty pages but failed
> > to
> > > > write>
> > > > > > the new entry marker. How can you recover the entry marker?>
> > > > > >>
> > > > > > 2.  When a broker crashes and recovers the managed ledger, the
> > > cursor>
> > > > > > ledger is not writable anymore. Are you going to create a new
> > cursor
> > > > ledger>
> > > > > > and copy all the entries from the old cursor ledger to the new
> > one?>
> > > > > >>
> > > > > > It would be good if you can clarify these two questions.>
> > > > > >>
> > > > > > - Sijie>
> > > > > >>
> > > > > > On Sun, Jan 17, 2021 at 9:48 PM linlin <li...@apache.org>
> wrote:>
> > > > > >>
> > > > > > > Hi, community:>
> > > > > > >     Recently we encountered some problems when using
> individual>
> > > > > > > acknowledgments, such as:>
> > > > > > > when the amount of acknowledgment is large, entry writing
> fails;
> > a
> > > > large>
> > > > > > > amount of cache causes OOM, etc.>
> > > > > > > So I drafted a PIP in `>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing`
> <https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing>
> > <
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >
> > > <
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > >
> > > > <
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > >
> > > > >
> > > >
> > > > > > <
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > >>
> > > >
> > > > > > > <>
> > > > > >
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > > >
> > > >
> > > > > > >>
> > > > > > > ,>
> > > > > > > any voice is welcomed.>
> > > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Rajan Dhabalia <rd...@apache.org>.

*Hi, >> Delayed messages or certain user logic can introduce a lot of
message-holes. We have seen this issue in quite a lot of customers'
production environment.I agree that we need a larger buffer to store and
recover individually deleted messages, and the existing 150K limit might
not be enough for many usescases. However, I would also like to highlight
issues that happened in the past due to keeping a large number of deleted
messages in the broker’s memory and one of them was high GC pauses.
Therefore, we introduced ConcurrentOpenLongPairRangeSet
<https://github.com/apache/pulsar/pull/3818> to manage deleted messages
without actually storing range objects in memory. OpenRangeSet uses bitset
to store ranges in memory and we can also utilize it to persist in disk for
the recovery. This approach has various advantages: simple implementation,
large enough range for recovery, and it skips intermediate conversion from
unack-messages to bitset in OpenRangeSet which saves extra CPU while
recovery.I implemented a simple prototype
<https://github.com/rdhabalia/pulsar/commit/1f8e5e745e9f1d1429697b5dee1da70545385653>
to store deleted messages using bitset in OpenRangeSet and we can persist
10M ranges with 5MB data size which I guess is large enough for any
usecases. So, we can use this approach to solve the problem without
introducing unnecessary complexity in managed-cursor.Thanks,Rajan*

On Fri, Jan 22, 2021 at 7:52 PM Sijie Guo <gu...@gmail.com> wrote:

> Joe - Delayed messages or certain user logic can introduce a lot of message
> holes. We have seen this issue in quite a lot of customers' production
> environment. Hence we need to find a solution for solving these problems.
> If you are skeptical of an implementation like that, how about us making
> cursor implementation pluggable. We can make this proposal implemented as
> one plugin. So it will not impact any existing logic but allowing people
> use a plugin to solve this problem.
>
> Thanks,
> Sijie
>
> On Fri, Jan 22, 2021 at 5:00 PM Joe Francis <joef@verizonmedia.com.invalid
> >
> wrote:
>
> > Let me take a step back and explain  how I am looking at this from a
> > high-level
> > design viewpoint
> >
> >
> > Bookkeeper (BK) is like an LSM implementation of a KV store. Writes to
> all
> > keys are appended to a single file; deletes are logical.  Compaction
> > reclaims space.  An Index is used locate entries, tracking logical
> deletes
> > and reclaim space.
> >
> >
> > The index in BK  is another LSM.  Again, writes are appended, deletes are
> > logical, and  an index is used to  locate entries , account for deletes
> and
> > compaction to reclaim space (the implementation within rocksdb is far
> more
> > complex with bloom filters and memtables, but you get the idea )   BK
> just
> > uses a sophisticated index (rocksdb) which is tiny and cacheable and
> > rocksdb has within it a sophisticated index which is small and cacheable
> >
> >
> > So when I look at this proposal, what I see is the same - another attempt
> > to build an LSM with a sophisticated index/cache mechanism using log
> > structured storage. So I am quite skeptical that this needs to solved
> this
> > way,  within Pulsar.
> >
> >
> >
> > Joe
> >
> > On Wed, Jan 20, 2021 at 12:30 AM linlin <li...@apache.org> wrote:
> >
> > > We can look at ManagedCursorImpl.buildIndividualDeletedMessageRanges
> > >
> > > What is saved in the entry is not a bitSet, but a messageRange one by
> > one,
> > > which contains information such as ledgerId and entryId. BitSet only
> > exists
> > > in the memory and is used to quickly determine whether it already
> exists.
> > > In addition, the position of each ack will be stored in the
> > > individualDeletedMessages queue. When persisted to the entry, the queue
> > > will be traversed, and the position information of each ack will
> > generate a
> > > messageRange.
> > > A messageRange contains lowerEndpoint (ledgerId+entryId), upperEndpoint
> > > (ledgerId+entryId), 4 longs, about 256 bits.
> > >
> > > We assume a more extreme scenario, 300K messages, every other ack has
> an
> > > unacknowledged, that is, 150K location information will be stored in
> > > individualDeletedMessages. 150K * 256/8/1024 /1024 ≈ 4.6MB
> > > Of course, there are also scenarios where the customer's ack spans
> > several
> > > ledgers.
> > >
> > >
> > > On 2021/01/20 00:38:47, Joe F <j....@gmail.com> wrote:
> > > > I have a simpler question. Just storing the message-ids raw will fit
> > > ~300K>
> > > > entries in one ledger entry. With the bitmap  changes, we can store
> a>
> > > > couple of million  within one 5MB ledger entry.  So can you tell us
> > what>
> > > > numbers of unacked messages are  creating a problem?  What exactly
> are
> > > the>
> > > > issues you face, and at what numbers of unacked messages/memory use
> > etc?>
> > > >
> > > > I have my own concerns about this proposal, but I would like to
> > > understand>
> > > > the problem first>
> > > >
> > > > Joe>
> > > >
> > > > On Sun, Jan 17, 2021 at 10:16 PM Sijie Guo <gu...@gmail.com> wrote:>
> > > >
> > > > > Hi Lin,>
> > > > >>
> > > > > Thanks you and Penghui for drafting this! We have seen a lot of
> pain
> > > points>
> > > > > of `managedLedgerMaxUnackedRangesToPersist` when enabling delayed
> > > messages.>
> > > > > Glad that you and Penghui are spending time on resolving this!>
> > > > >>
> > > > > Overall the proposal looks good. But I have a couple of questions
> > about
> > > the>
> > > > > proposal.>
> > > > >>
> > > > > 1. What happens if the broker fails to write the entry marker? For
> > > example,>
> > > > > at t0, the broker flushes dirty pages and successfully writes an
> > entry>
> > > > > marker. At t1, the broker tries to flushes dirty pages but failed
> to
> > > write>
> > > > > the new entry marker. How can you recover the entry marker?>
> > > > >>
> > > > > 2.  When a broker crashes and recovers the managed ledger, the
> > cursor>
> > > > > ledger is not writable anymore. Are you going to create a new
> cursor
> > > ledger>
> > > > > and copy all the entries from the old cursor ledger to the new
> one?>
> > > > >>
> > > > > It would be good if you can clarify these two questions.>
> > > > >>
> > > > > - Sijie>
> > > > >>
> > > > > On Sun, Jan 17, 2021 at 9:48 PM linlin <li...@apache.org> wrote:>
> > > > >>
> > > > > > Hi, community:>
> > > > > >     Recently we encountered some problems when using individual>
> > > > > > acknowledgments, such as:>
> > > > > > when the amount of acknowledgment is large, entry writing fails;
> a
> > > large>
> > > > > > amount of cache causes OOM, etc.>
> > > > > > So I drafted a PIP in `>
> > > > > >>
> > > > > >>
> > > > >
> > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing`
> <https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing>
> > <
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >
> > > <
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > >
> > > >
> > >
> > > > > <
> > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > >>
> > >
> > > > > > <>
> > > > >
> > >
> > >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > > >
> > >
> > > > > >>
> > > > > > ,>
> > > > > > any voice is welcomed.>
> > > > > >>
> > > > >>
> > > >
> > >
> >
>

Re: [E] Re: [PIP-78] Split the individual acknowledgments into multiple entries

Posted by Sijie Guo <gu...@gmail.com>.

Joe - Delayed messages or certain user logic can introduce a lot of message
holes. We have seen this issue in quite a lot of customers' production
environment. Hence we need to find a solution for solving these problems.
If you are skeptical of an implementation like that, how about us making
cursor implementation pluggable. We can make this proposal implemented as
one plugin. So it will not impact any existing logic but allowing people
use a plugin to solve this problem.

Thanks,
Sijie

On Fri, Jan 22, 2021 at 5:00 PM Joe Francis <jo...@verizonmedia.com.invalid>
wrote:

> Let me take a step back and explain  how I am looking at this from a
> high-level
> design viewpoint
>
>
> Bookkeeper (BK) is like an LSM implementation of a KV store. Writes to all
> keys are appended to a single file; deletes are logical.  Compaction
> reclaims space.  An Index is used locate entries, tracking logical deletes
> and reclaim space.
>
>
> The index in BK  is another LSM.  Again, writes are appended, deletes are
> logical, and  an index is used to  locate entries , account for deletes and
> compaction to reclaim space (the implementation within rocksdb is far more
> complex with bloom filters and memtables, but you get the idea )   BK just
> uses a sophisticated index (rocksdb) which is tiny and cacheable and
> rocksdb has within it a sophisticated index which is small and cacheable
>
>
> So when I look at this proposal, what I see is the same - another attempt
> to build an LSM with a sophisticated index/cache mechanism using log
> structured storage. So I am quite skeptical that this needs to solved this
> way,  within Pulsar.
>
>
>
> Joe
>
> On Wed, Jan 20, 2021 at 12:30 AM linlin <li...@apache.org> wrote:
>
> > We can look at ManagedCursorImpl.buildIndividualDeletedMessageRanges
> >
> > What is saved in the entry is not a bitSet, but a messageRange one by
> one,
> > which contains information such as ledgerId and entryId. BitSet only
> exists
> > in the memory and is used to quickly determine whether it already exists.
> > In addition, the position of each ack will be stored in the
> > individualDeletedMessages queue. When persisted to the entry, the queue
> > will be traversed, and the position information of each ack will
> generate a
> > messageRange.
> > A messageRange contains lowerEndpoint (ledgerId+entryId), upperEndpoint
> > (ledgerId+entryId), 4 longs, about 256 bits.
> >
> > We assume a more extreme scenario, 300K messages, every other ack has an
> > unacknowledged, that is, 150K location information will be stored in
> > individualDeletedMessages. 150K * 256/8/1024 /1024 ≈ 4.6MB
> > Of course, there are also scenarios where the customer's ack spans
> several
> > ledgers.
> >
> >
> > On 2021/01/20 00:38:47, Joe F <j....@gmail.com> wrote:
> > > I have a simpler question. Just storing the message-ids raw will fit
> > ~300K>
> > > entries in one ledger entry. With the bitmap  changes, we can store a>
> > > couple of million  within one 5MB ledger entry.  So can you tell us
> what>
> > > numbers of unacked messages are  creating a problem?  What exactly are
> > the>
> > > issues you face, and at what numbers of unacked messages/memory use
> etc?>
> > >
> > > I have my own concerns about this proposal, but I would like to
> > understand>
> > > the problem first>
> > >
> > > Joe>
> > >
> > > On Sun, Jan 17, 2021 at 10:16 PM Sijie Guo <gu...@gmail.com> wrote:>
> > >
> > > > Hi Lin,>
> > > >>
> > > > Thanks you and Penghui for drafting this! We have seen a lot of pain
> > points>
> > > > of `managedLedgerMaxUnackedRangesToPersist` when enabling delayed
> > messages.>
> > > > Glad that you and Penghui are spending time on resolving this!>
> > > >>
> > > > Overall the proposal looks good. But I have a couple of questions
> about
> > the>
> > > > proposal.>
> > > >>
> > > > 1. What happens if the broker fails to write the entry marker? For
> > example,>
> > > > at t0, the broker flushes dirty pages and successfully writes an
> entry>
> > > > marker. At t1, the broker tries to flushes dirty pages but failed to
> > write>
> > > > the new entry marker. How can you recover the entry marker?>
> > > >>
> > > > 2.  When a broker crashes and recovers the managed ledger, the
> cursor>
> > > > ledger is not writable anymore. Are you going to create a new cursor
> > ledger>
> > > > and copy all the entries from the old cursor ledger to the new one?>
> > > >>
> > > > It would be good if you can clarify these two questions.>
> > > >>
> > > > - Sijie>
> > > >>
> > > > On Sun, Jan 17, 2021 at 9:48 PM linlin <li...@apache.org> wrote:>
> > > >>
> > > > > Hi, community:>
> > > > >     Recently we encountered some problems when using individual>
> > > > > acknowledgments, such as:>
> > > > > when the amount of acknowledgment is large, entry writing fails; a
> > large>
> > > > > amount of cache causes OOM, etc.>
> > > > > So I drafted a PIP in `>
> > > > >>
> > > > >>
> > > >
> >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing`
> <https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing>
> > <
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> >
> > >
> >
> > > > <
> >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > >>
> >
> > > > > <>
> > > >
> >
> >
> https://docs.google.com/document/d/1uQtyb8t6X04v2vrSrdGWLFkuCkBcGYZbqK8XsVJ4qkU/edit?usp=sharing
> > >
> >
> > > > >>
> > > > > ,>
> > > > > any voice is welcomed.>
> > > > >>
> > > >>
> > >
> >
>