You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by 太上玄元道君 <da...@apache.org> on 2024/03/19 11:35:52 UTC

[VOTE] PIP-345: Optimize finding message by timestamp

Hi Pulsar community,

This thread is to start a vote for PIP-345: Optimize finding message by
timestamp

PIP: https://github.com/apache/pulsar/pull/22234
Discuss thread:
https://lists.apache.org/thread/5owc9os6wmy52zxbv07qo2jrfjm17hd2

Thanks,
Tao Jiuming

Re: [VOTE] PIP-345: Optimize finding message by timestamp

Posted by PengHui Li <pe...@apache.org>.
Hi, Jiuming

Yes, it's not a good one "ManagedLedger#getEarliestM
essagePublishTimeInBacklog"
and it should be the only one in the ManagedLedger to have a publish time
concept.
I think we mixed the concepts in https://github.com/apache/pulsar/pull/12523,
which is bad.
It's better to start a proposal to deprecate this method and change
existing implemetation.

> For finding message by timestamp, we can introduce `sparse index` to
Pulsar, after add entries complete, add a index to `ManagedLedgerIndex` and
store the index to ML. What do you think?

Yes, we can have different options. If users do not have too much data in
one Ledger (and it is configurable), It should be fine. We can just build
the index based on the Ledger's timestamp (the Ledger close time). By
default, it should be good for many use cases.

Since we have the ManagedLedgerIndex abstract, users can also develop their
own implementations
for extreme performance requirements. Just keep the Pulsar core more clear,
simple and work for most
common cases.

Regards,
Penghui


On Mon, Mar 25, 2024 at 5:47 PM 太上玄元道君 <da...@apache.org> wrote:

> Hi Penghui,
>
> Thanks for your feedback!
>
> I'm not sure about this either, since publishTimestamp is a Messaging layer
> concept, and ML as a Persistence layer should not be aware about this.
>
> But in ML, I'd noticed some methods searching message by
> PublishTimestamp(say,
> ManagedLedgerImpl#getEarliestMessagePublishTimeInBacklog),
>  so that's why I want to add publishTimestamp to ML.
>
> Introduce secondary index to ML is a good idea, since RocketMQ has a `Hash
> index`, and Kakfa has a `Sparse index`.
>
> For finding message by timestamp, we can introduce `sparse index` to
> Pulsar, after add entries complete, add a index to `ManagedLedgerIndex` and
> store the index to ML. What do you think?
>
> Thanks,
> Tao Jiuming
>
>
>
> PengHui Li <pe...@apache.org> 于2024年3月25日周一 15:17写道:
>
> > Hi, Jiuming
> >
> > I'm sorry for not getting back to you sooner.
> >
> > First, I support the motivation to optimize this case because it could
> be a
> > significant
> > blocker for users who want infinite data retention, which is a BIG
> > differentiator
> > with Apache Kafka. And, I really saw the cases with high publish
> > throughput, and one
> > ledger could even hold 1M entries, 100M new entries published to a topic.
> >
> > Then, I try to check the details of the existing implementation. I think
> > the tricky part is
> > the publish time is not the concept of the ManageLedger. I saw the
> changes
> > that you
> > proposed will add publish time to the ManageLedger module, which doesn't
> > look good
> > me. Because it will couple the Pulsar concept with the ManageLedger
> > concept.
> >
> > Essentially, the publish time could be a secondary index of the
> > ManageLedger.
> > My opinion is to have a general ManagedLedgerIndex abstract, and the
> Pulsar
> > broker
> > can create any index it wants. Since the broker creates the index, the
> > broker can control the
> > index's behavior. Then, the ManageLedger can provide an API to search the
> > entry
> > with a ManagedLedgerIndex. With this option, we don't need to add the
> > publish
> > time concept to ManagedLedger directly.
> >
> > In this case, if the broker tries to search the entry with a predicate
> and
> > index. The managed
> > ledger will search from the index first. Of course, if the relevant entry
> > cannot be found in the index,
> > just fall back to the "optimized full scan".
> >
> > Regards,
> > Penghui
> >
> >
> > On Mon, Mar 25, 2024 at 11:51 AM 太上玄元道君 <da...@apache.org> wrote:
> >
> > > bump
> > >
> > > 太上玄元道君 <da...@apache.org>于2024年3月20日 周三16:23写道:
> > >
> > > > bump
> > > >
> > > > 太上玄元道君 <da...@apache.org>于2024年3月19日 周二19:35写道:
> > > >
> > > >> Hi Pulsar community,
> > > >>
> > > >> This thread is to start a vote for PIP-345: Optimize finding message
> > by
> > > >> timestamp
> > > >>
> > > >> PIP: https://github.com/apache/pulsar/pull/22234
> > > >> Discuss thread:
> > > >> https://lists.apache.org/thread/5owc9os6wmy52zxbv07qo2jrfjm17hd2
> > > >>
> > > >> Thanks,
> > > >> Tao Jiuming
> > > >>
> > > >
> > >
> >
>

Re: [VOTE] PIP-345: Optimize finding message by timestamp

Posted by 太上玄元道君 <da...@apache.org>.
Hi Penghui,

Thanks for your feedback!

I'm not sure about this either, since publishTimestamp is a Messaging layer
concept, and ML as a Persistence layer should not be aware about this.

But in ML, I'd noticed some methods searching message by
PublishTimestamp(say,
ManagedLedgerImpl#getEarliestMessagePublishTimeInBacklog),
 so that's why I want to add publishTimestamp to ML.

Introduce secondary index to ML is a good idea, since RocketMQ has a `Hash
index`, and Kakfa has a `Sparse index`.

For finding message by timestamp, we can introduce `sparse index` to
Pulsar, after add entries complete, add a index to `ManagedLedgerIndex` and
store the index to ML. What do you think?

Thanks,
Tao Jiuming



PengHui Li <pe...@apache.org> 于2024年3月25日周一 15:17写道:

> Hi, Jiuming
>
> I'm sorry for not getting back to you sooner.
>
> First, I support the motivation to optimize this case because it could be a
> significant
> blocker for users who want infinite data retention, which is a BIG
> differentiator
> with Apache Kafka. And, I really saw the cases with high publish
> throughput, and one
> ledger could even hold 1M entries, 100M new entries published to a topic.
>
> Then, I try to check the details of the existing implementation. I think
> the tricky part is
> the publish time is not the concept of the ManageLedger. I saw the changes
> that you
> proposed will add publish time to the ManageLedger module, which doesn't
> look good
> me. Because it will couple the Pulsar concept with the ManageLedger
> concept.
>
> Essentially, the publish time could be a secondary index of the
> ManageLedger.
> My opinion is to have a general ManagedLedgerIndex abstract, and the Pulsar
> broker
> can create any index it wants. Since the broker creates the index, the
> broker can control the
> index's behavior. Then, the ManageLedger can provide an API to search the
> entry
> with a ManagedLedgerIndex. With this option, we don't need to add the
> publish
> time concept to ManagedLedger directly.
>
> In this case, if the broker tries to search the entry with a predicate and
> index. The managed
> ledger will search from the index first. Of course, if the relevant entry
> cannot be found in the index,
> just fall back to the "optimized full scan".
>
> Regards,
> Penghui
>
>
> On Mon, Mar 25, 2024 at 11:51 AM 太上玄元道君 <da...@apache.org> wrote:
>
> > bump
> >
> > 太上玄元道君 <da...@apache.org>于2024年3月20日 周三16:23写道:
> >
> > > bump
> > >
> > > 太上玄元道君 <da...@apache.org>于2024年3月19日 周二19:35写道:
> > >
> > >> Hi Pulsar community,
> > >>
> > >> This thread is to start a vote for PIP-345: Optimize finding message
> by
> > >> timestamp
> > >>
> > >> PIP: https://github.com/apache/pulsar/pull/22234
> > >> Discuss thread:
> > >> https://lists.apache.org/thread/5owc9os6wmy52zxbv07qo2jrfjm17hd2
> > >>
> > >> Thanks,
> > >> Tao Jiuming
> > >>
> > >
> >
>

Re: [VOTE] PIP-345: Optimize finding message by timestamp

Posted by PengHui Li <pe...@apache.org>.
Hi, Jiuming

I'm sorry for not getting back to you sooner.

First, I support the motivation to optimize this case because it could be a
significant
blocker for users who want infinite data retention, which is a BIG
differentiator
with Apache Kafka. And, I really saw the cases with high publish
throughput, and one
ledger could even hold 1M entries, 100M new entries published to a topic.

Then, I try to check the details of the existing implementation. I think
the tricky part is
the publish time is not the concept of the ManageLedger. I saw the changes
that you
proposed will add publish time to the ManageLedger module, which doesn't
look good
me. Because it will couple the Pulsar concept with the ManageLedger concept.

Essentially, the publish time could be a secondary index of the
ManageLedger.
My opinion is to have a general ManagedLedgerIndex abstract, and the Pulsar
broker
can create any index it wants. Since the broker creates the index, the
broker can control the
index's behavior. Then, the ManageLedger can provide an API to search the
entry
with a ManagedLedgerIndex. With this option, we don't need to add the
publish
time concept to ManagedLedger directly.

In this case, if the broker tries to search the entry with a predicate and
index. The managed
ledger will search from the index first. Of course, if the relevant entry
cannot be found in the index,
just fall back to the "optimized full scan".

Regards,
Penghui


On Mon, Mar 25, 2024 at 11:51 AM 太上玄元道君 <da...@apache.org> wrote:

> bump
>
> 太上玄元道君 <da...@apache.org>于2024年3月20日 周三16:23写道:
>
> > bump
> >
> > 太上玄元道君 <da...@apache.org>于2024年3月19日 周二19:35写道:
> >
> >> Hi Pulsar community,
> >>
> >> This thread is to start a vote for PIP-345: Optimize finding message by
> >> timestamp
> >>
> >> PIP: https://github.com/apache/pulsar/pull/22234
> >> Discuss thread:
> >> https://lists.apache.org/thread/5owc9os6wmy52zxbv07qo2jrfjm17hd2
> >>
> >> Thanks,
> >> Tao Jiuming
> >>
> >
>

Re: [VOTE] PIP-345: Optimize finding message by timestamp

Posted by 太上玄元道君 <da...@apache.org>.
bump

太上玄元道君 <da...@apache.org>于2024年3月20日 周三16:23写道:

> bump
>
> 太上玄元道君 <da...@apache.org>于2024年3月19日 周二19:35写道:
>
>> Hi Pulsar community,
>>
>> This thread is to start a vote for PIP-345: Optimize finding message by
>> timestamp
>>
>> PIP: https://github.com/apache/pulsar/pull/22234
>> Discuss thread:
>> https://lists.apache.org/thread/5owc9os6wmy52zxbv07qo2jrfjm17hd2
>>
>> Thanks,
>> Tao Jiuming
>>
>

Re: [VOTE] PIP-345: Optimize finding message by timestamp

Posted by 太上玄元道君 <da...@apache.org>.
bump

太上玄元道君 <da...@apache.org>于2024年3月19日 周二19:35写道:

> Hi Pulsar community,
>
> This thread is to start a vote for PIP-345: Optimize finding message by
> timestamp
>
> PIP: https://github.com/apache/pulsar/pull/22234
> Discuss thread:
> https://lists.apache.org/thread/5owc9os6wmy52zxbv07qo2jrfjm17hd2
>
> Thanks,
> Tao Jiuming
>