You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Yunze Xu <yz...@streamnative.io.INVALID> on 2022/11/08 10:24:49 UTC

[DISCUSSION] Redesign the MessageId interface

Hi all,

Currently we have the following 5 implementations of MessageId:

- MessageIdImpl: (ledger id, entry id, partition index)
  - BatchMessageIdImpl: adds (batch index, batch size, acker), where
    acker is a wrapper of a BitSet.
  - ChunkMessageIdImpl: adds another MessageIdImpl that represents
    the first MessageIdImpl of a BitSet.
  - MultiMessageIdImpl: adds a map that maps the topic name to the
    MessageId.
- TopicMessageIdImpl: adds the topic name and the partition name

These implementations are such a mess. For example, when users get a
MessageId from `Producer#send`:

```java
var id = producer.send("msg");
```

There is no getter to get some specific fields like ledger id. You can
only see a representation from `toString` method and got some output
like "0:0:-1:0". Maybe this design is to hidden some details, but if
users don't know the details like ledger id and entry id, how could
you know what does "0:0:-1:0" mean? What if `MessageId#toString`'s
implementation changed? Should it be treated as a breaking change?

The original definition of the underlying MessageIdData is much more
clear:

```proto
message MessageIdData {
    required uint64 ledgerId = 1;
    required uint64 entryId  = 2;
    optional int32 partition = 3 [default = -1];
    optional int32 batch_index = 4 [default = -1];
    repeated int64 ack_set = 5;
    optional int32 batch_size = 6;

    // For the chunk message id, we need to specify the first chunk message id.
    optional MessageIdData first_chunk_message_id = 7;
}
```

IMO, MessageId should be a wrapper of MessageIdData. It's more natural
to have an interface like:

```java
interface MessageId {
    long ledgerId();
    long entryId();
    Optional<Integer> partition();
    Optional<Integer> batchIndex();
    // ...
```

Additionally, there are many places that use only the triple of
(ledger id, entry id, batch index) as the key to represent the position.
Currently, they are done by adding a conversion from
BatchMessageIdImpl to MessageIdImpl. However, it's more intuitive to
write something like:

```java
class MessageIdPosition implements Comparable<MessageIdPosition> {
    private final MessageId messageId;
    // TODO: compare only the triple (ledger, entry, batch)
```

Therefore, I'm going to write a proposal to redesign the MessageId
interface only by adding some getters. Regarding the 5 existing
implementations, I think we can drop them because they are a part
of `pulsar-client`, not `pulsar-client-api`.

Please feel free to share your points.

Thanks,
Yunze

Re: [DISCUSSION] Redesign the MessageId interface

Posted by PengHui Li <pe...@apache.org>.
Great discussion!
I agree with the conclusions!

IMO, we should not encourage users to touch the different implementations
of the MessageId.
Instead, we should provide a `MesssageIdUtils` in the client library(not
the interface)
with annotations `@InterfaceAudience.Public @InterfaceStability.Stable`. It
is just for
the specific purpose that users have to access the LedgerId and EntryId.

Thanks,
Penghui

On Sat, Nov 12, 2022 at 6:53 AM Michael Marshall <mm...@apache.org>
wrote:

> Great discussion. I generally agree with the conclusions.
>
> I'll add two points.
>
> Several endpoints in the topics admin api require that message ids are
> consistently serialized and deserialized over HTTP. For example, the
> reset cursor call requires a message id in a correct format or
> "latest" or "earliest". I don't think this is a break in the
> abstraction, but it does imply that the serialization definitions
> (e.g. `toString()`) in each client language need to be consistent.
>
> Second, the admin api exposes the ledgerId/entryId implementation with
> the get message id endpoint, which asks users to specify a ledger id
> and an entry id instead of a serialized message id:
>
>
> https://github.com/apache/pulsar/blob/681d51d54937becc5c436cece3b731150afdf6d1/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/v2/PersistentTopics.java#L1930-L1944
>
> The Admin Client exposes this endpoint here:
>
>
> https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1658-L1682
>
> Based on the conclusion of this thread, it seems like we should
> deprecate the affected methods and replace them with ones that take a
> message id instead.
>
> Thanks,
> Michael
>
>
>
> On Wed, Nov 9, 2022 at 8:34 PM Yunze Xu <yz...@streamnative.io.invalid>
> wrote:
> >
> > I also changed my mind after I saw Flink's MesssageIdUtils
> implementation.
> >
> > Now it's clear to me that:
> > - For application users, the APIs in the pulsar-client-api module are
> > what they should use.
> > - For Pulsar ecosystem developers, the APIs in the pulsar-client
> > module are interfaces
> >
> > So at the moment, these MessageId implementations could already be used
> in some
> > external applications.
> >
> > In conclusion,
> > 1. The MessageId interface should not be touched
> > 2. The public methods of the MessageId implementations should not be
> touched
> >
> > Based on these two key points, I'm going to look into these
> > implementations and mark
> > some as deprecated but they should still work like before.
> >
> > Thanks,
> > Yunze
> >
> > On Thu, Nov 10, 2022 at 3:50 AM Rajan Dhabalia <rd...@apache.org>
> wrote:
> > >
> > > Hi,
> > >
> > > I was reading the email thread why we want to change MessageId
> interface:
> > > https://lists.apache.org/thread/rdkqnkohbmkjjs61hvoqplhhngr0b0sd
> > > >> Currently we have the following 5 implementations of MessageId:
> > > >> These implementations are such a mess. For example, when users get a
> > > MessageId from `Producer#send`:
> > >
> > > I think above discussion started by stating that MessageId has multiple
> > > implementations eg: MessageIdImpl, BatchMessageIdImpl,
> ChunkMessageIdImpl,
> > > etc.. and as a client we receive a single MessageId with send message
> API.
> > >
> > > Well, I see it's expected and very well defined behavior. No matter
> what
> > > implementation client library is internally using, as a user for my
> topic I
> > > should be able to publish and consume messages by providing specific
> > > configurations. The moment a user has to know implementation details
> of a
> > > client library such as chunk or batch message internals, there will be
> a
> > > strong dependency created between application and server implementation
> > > which is a RED flag for any system. Abstraction helps user adoption by
> > > simplifying user API and allowing systems to enhance without worrying
> about
> > > application dependency.
> > > Removing abstraction of MessageId and extracting implementation
> information
> > > in getter API will prevent us to make any implementation change (eg:
> won't
> > > be able to change chunk mesasgeId behavior which we might need in
> future
> > > for Shared-subscription), introduce hacks for any enhancement by making
> > > sure existing internal implementation can't change (eg: if we can't
> change
> > > chunk/batch-message-Id then we might create a new wrapper and attach to
> > > existing messageId which will make even things worse), and even
> prevent us
> > > to change backend implementation (eg: we will be stuck with BK-Ledger
> > > implementation forever).
> > > Pulsar is used for many large scale business usecaeses with a large
> number
> > > of users and it can create a nightmare for operators and users if we
> remove
> > > abstractions and start following the practice of exposing system
> internals
> > > to user applications.
> > >
> > > Thanks,
> > > Rajan
> > >
> > >
> > > On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yz...@streamnative.io.invalid>
> > > wrote:
> > >
> > > > Hi Joe,
> > > >
> > > > Then what would we expect users to do with the MessageId? It should
> only
> > > > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> > > >
> > > > What about the partition index? We have a `TopicMetadata` interface
> that
> > > > returns
> > > > the number of partitions. If the partition is also "implementation
> > > > details", should we expose
> > > > this interface? Or should we support customizing a MessageRouter
> because it
> > > > returns the partition index?
> > > >
> > > > What about the batch index and batch size? For example, we have an
> > > > enableBatchIndexAcknowledgment method to enable batch index ACK. If
> batch
> > > > index is also "implementation details", how could users know what
> does
> > > > "batch
> > > > index ack" mean?
> > > >
> > > > Even for ledger id and entry id, this pair represents a logic storage
> > > > position like the offset
> > > > concept in Kafka (though each offset represents a message while each
> > > > entry represents
> > > > a batch). If you see the Message API, it also exposes many
> attributes.
> > > > IMO, for the
> > > > MessageIdData, only the ack_set (a long array serialized from the
> > > > BitSet) is the implementation
> > > > detail.
> > > >
> > > > The MessageId API should be flexible, not an abstract one. If not,
> why
> > > > do we still implement
> > > > the toString() method? We should not encourage users to print the
> > > > MessageId. It would
> > > > be easy to know what "ledger is 0, entry id is 1" means, users only
> > > > need to know the concepts
> > > > of ledger id and entry id. But it would be harder to know a tuple
> like
> > > > "0:1:-1:-1" means.
> > > >
> > > > Thanks,
> > > > Yunze
> > > >
> > > > On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
> > > > >
> > > > > >Maybe this design is to hidden some details, but if
> > > > > users don't know the details like ledger id and entry id, how could
> > > > > you know what does "0:0:-1:0" mean?
> > > > >
> > > > >  Abstractions exist for a reason. Ledgerid and entryid are
> > > > implementation
> > > > > details, and an application should not be interpreting that at all.
> > > > > -j
> > > > >
> > > > >
> > > > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu
> <yz...@streamnative.io.invalid>
> > > > > wrote:
> > > > >
> > > > > > I didn't look into these two methods at the moment. But I think
> it's
> > > > > > possible to
> > > > > > retain only the `fromByteArray`.
> > > > > >
> > > > > > Thanks,
> > > > > > Yunze
> > > > > >
> > > > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <
> eolivelli@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > >
> > > > > > > > Hi Enrico,
> > > > > > > >
> > > > > > > > > We also need a way to represent this as a String or a
> byte[]
> > > > > > > >
> > > > > > > > We already have the `toByteArray` method, right?
> > > > > > >
> > > > > > > Yes, correct. So we are fine. I forgot about it and I answered
> too
> > > > > > quickly.
> > > > > > >
> > > > > > > I am not sure if this can be in the scope of this initiative,
> but we
> > > > > > > should somehow get rid of
> > > > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > > > >
> > > > > > > Thanks
> > > > > > > Enrico
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yunze
> > > > > > > >
> > > > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > > > eolivelli@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Currently we have the following 5 implementations of
> MessageId:
> > > > > > > > > >
> > > > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size,
> acker),
> > > > > > where
> > > > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > > > represents
> > > > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic
> name to
> > > > the
> > > > > > > > > >     MessageId.
> > > > > > > > > > - TopicMessageIdImpl: adds the topic name and the
> partition
> > > > name
> > > > > > > > > >
> > > > > > > > > > These implementations are such a mess. For example, when
> users
> > > > get
> > > > > > a
> > > > > > > > > > MessageId from `Producer#send`:
> > > > > > > > > >
> > > > > > > > > > ```java
> > > > > > > > > > var id = producer.send("msg");
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > There is no getter to get some specific fields like
> ledger id.
> > > > You
> > > > > > can
> > > > > > > > > > only see a representation from `toString` method and got
> some
> > > > > > output
> > > > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some
> details,
> > > > but
> > > > > > if
> > > > > > > > > > users don't know the details like ledger id and entry
> id, how
> > > > could
> > > > > > > > > > you know what does "0:0:-1:0" mean? What if
> > > > `MessageId#toString`'s
> > > > > > > > > > implementation changed? Should it be treated as a
> breaking
> > > > change?
> > > > > > > > > >
> > > > > > > > > > The original definition of the underlying MessageIdData
> is much
> > > > > > more
> > > > > > > > > > clear:
> > > > > > > > > >
> > > > > > > > > > ```proto
> > > > > > > > > > message MessageIdData {
> > > > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > > > >     required uint64 entryId  = 2;
> > > > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > > > >     optional int32 batch_size = 6;
> > > > > > > > > >
> > > > > > > > > >     // For the chunk message id, we need to specify the
> first
> > > > > > chunk message id.
> > > > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > > > }
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > IMO, MessageId should be a wrapper of MessageIdData.
> It's more
> > > > > > natural
> > > > > > > > > > to have an interface like:
> > > > > > > > > >
> > > > > > > > > > ```java
> > > > > > > > > > interface MessageId {
> > > > > > > > > >     long ledgerId();
> > > > > > > > > >     long entryId();
> > > > > > > > > >     Optional<Integer> partition();
> > > > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > > > >     // ...
> > > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > This is very good for client applications.
> > > > > > > > > We also need a way to represent this as a String or a
> byte[],
> > > > this
> > > > > > way
> > > > > > > > > client applications have a standard way to store
> > > > > > > > > message offsets into an external system (for instance when
> you
> > > > want
> > > > > > to
> > > > > > > > > user the Reader API and keep track of the position by
> yourself)
> > > > > > > > >
> > > > > > > > > Enrico
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Additionally, there are many places that use only the
> triple of
> > > > > > > > > > (ledger id, entry id, batch index) as the key to
> represent the
> > > > > > position.
> > > > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > > > intuitive
> > > > > > to
> > > > > > > > > > write something like:
> > > > > > > > > >
> > > > > > > > > > ```java
> > > > > > > > > > class MessageIdPosition implements
> > > > Comparable<MessageIdPosition> {
> > > > > > > > > >     private final MessageId messageId;
> > > > > > > > > >     // TODO: compare only the triple (ledger, entry,
> batch)
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > Therefore, I'm going to write a proposal to redesign the
> > > > MessageId
> > > > > > > > > > interface only by adding some getters. Regarding the 5
> existing
> > > > > > > > > > implementations, I think we can drop them because they
> are a
> > > > part
> > > > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > > > >
> > > > > > > > > > Please feel free to share your points.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Yunze
> > > > > >
> > > >
>

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Michael Marshall <mm...@apache.org>.
Great discussion. I generally agree with the conclusions.

I'll add two points.

Several endpoints in the topics admin api require that message ids are
consistently serialized and deserialized over HTTP. For example, the
reset cursor call requires a message id in a correct format or
"latest" or "earliest". I don't think this is a break in the
abstraction, but it does imply that the serialization definitions
(e.g. `toString()`) in each client language need to be consistent.

Second, the admin api exposes the ledgerId/entryId implementation with
the get message id endpoint, which asks users to specify a ledger id
and an entry id instead of a serialized message id:

https://github.com/apache/pulsar/blob/681d51d54937becc5c436cece3b731150afdf6d1/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/v2/PersistentTopics.java#L1930-L1944

The Admin Client exposes this endpoint here:

https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/pulsar-client-admin-api/src/main/java/org/apache/pulsar/client/admin/Topics.java#L1658-L1682

Based on the conclusion of this thread, it seems like we should
deprecate the affected methods and replace them with ones that take a
message id instead.

Thanks,
Michael



On Wed, Nov 9, 2022 at 8:34 PM Yunze Xu <yz...@streamnative.io.invalid> wrote:
>
> I also changed my mind after I saw Flink's MesssageIdUtils implementation.
>
> Now it's clear to me that:
> - For application users, the APIs in the pulsar-client-api module are
> what they should use.
> - For Pulsar ecosystem developers, the APIs in the pulsar-client
> module are interfaces
>
> So at the moment, these MessageId implementations could already be used in some
> external applications.
>
> In conclusion,
> 1. The MessageId interface should not be touched
> 2. The public methods of the MessageId implementations should not be touched
>
> Based on these two key points, I'm going to look into these
> implementations and mark
> some as deprecated but they should still work like before.
>
> Thanks,
> Yunze
>
> On Thu, Nov 10, 2022 at 3:50 AM Rajan Dhabalia <rd...@apache.org> wrote:
> >
> > Hi,
> >
> > I was reading the email thread why we want to change MessageId interface:
> > https://lists.apache.org/thread/rdkqnkohbmkjjs61hvoqplhhngr0b0sd
> > >> Currently we have the following 5 implementations of MessageId:
> > >> These implementations are such a mess. For example, when users get a
> > MessageId from `Producer#send`:
> >
> > I think above discussion started by stating that MessageId has multiple
> > implementations eg: MessageIdImpl, BatchMessageIdImpl, ChunkMessageIdImpl,
> > etc.. and as a client we receive a single MessageId with send message API.
> >
> > Well, I see it's expected and very well defined behavior. No matter what
> > implementation client library is internally using, as a user for my topic I
> > should be able to publish and consume messages by providing specific
> > configurations. The moment a user has to know implementation details of a
> > client library such as chunk or batch message internals, there will be a
> > strong dependency created between application and server implementation
> > which is a RED flag for any system. Abstraction helps user adoption by
> > simplifying user API and allowing systems to enhance without worrying about
> > application dependency.
> > Removing abstraction of MessageId and extracting implementation information
> > in getter API will prevent us to make any implementation change (eg: won't
> > be able to change chunk mesasgeId behavior which we might need in future
> > for Shared-subscription), introduce hacks for any enhancement by making
> > sure existing internal implementation can't change (eg: if we can't change
> > chunk/batch-message-Id then we might create a new wrapper and attach to
> > existing messageId which will make even things worse), and even prevent us
> > to change backend implementation (eg: we will be stuck with BK-Ledger
> > implementation forever).
> > Pulsar is used for many large scale business usecaeses with a large number
> > of users and it can create a nightmare for operators and users if we remove
> > abstractions and start following the practice of exposing system internals
> > to user applications.
> >
> > Thanks,
> > Rajan
> >
> >
> > On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yz...@streamnative.io.invalid>
> > wrote:
> >
> > > Hi Joe,
> > >
> > > Then what would we expect users to do with the MessageId? It should only
> > > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> > >
> > > What about the partition index? We have a `TopicMetadata` interface that
> > > returns
> > > the number of partitions. If the partition is also "implementation
> > > details", should we expose
> > > this interface? Or should we support customizing a MessageRouter because it
> > > returns the partition index?
> > >
> > > What about the batch index and batch size? For example, we have an
> > > enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
> > > index is also "implementation details", how could users know what does
> > > "batch
> > > index ack" mean?
> > >
> > > Even for ledger id and entry id, this pair represents a logic storage
> > > position like the offset
> > > concept in Kafka (though each offset represents a message while each
> > > entry represents
> > > a batch). If you see the Message API, it also exposes many attributes.
> > > IMO, for the
> > > MessageIdData, only the ack_set (a long array serialized from the
> > > BitSet) is the implementation
> > > detail.
> > >
> > > The MessageId API should be flexible, not an abstract one. If not, why
> > > do we still implement
> > > the toString() method? We should not encourage users to print the
> > > MessageId. It would
> > > be easy to know what "ledger is 0, entry id is 1" means, users only
> > > need to know the concepts
> > > of ledger id and entry id. But it would be harder to know a tuple like
> > > "0:1:-1:-1" means.
> > >
> > > Thanks,
> > > Yunze
> > >
> > > On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
> > > >
> > > > >Maybe this design is to hidden some details, but if
> > > > users don't know the details like ledger id and entry id, how could
> > > > you know what does "0:0:-1:0" mean?
> > > >
> > > >  Abstractions exist for a reason. Ledgerid and entryid are
> > > implementation
> > > > details, and an application should not be interpreting that at all.
> > > > -j
> > > >
> > > >
> > > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
> > > > wrote:
> > > >
> > > > > I didn't look into these two methods at the moment. But I think it's
> > > > > possible to
> > > > > retain only the `fromByteArray`.
> > > > >
> > > > > Thanks,
> > > > > Yunze
> > > > >
> > > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > >
> > > > > > > Hi Enrico,
> > > > > > >
> > > > > > > > We also need a way to represent this as a String or a byte[]
> > > > > > >
> > > > > > > We already have the `toByteArray` method, right?
> > > > > >
> > > > > > Yes, correct. So we are fine. I forgot about it and I answered too
> > > > > quickly.
> > > > > >
> > > > > > I am not sure if this can be in the scope of this initiative, but we
> > > > > > should somehow get rid of
> > > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > > >
> > > > > > Thanks
> > > > > > Enrico
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yunze
> > > > > > >
> > > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > > eolivelli@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Currently we have the following 5 implementations of MessageId:
> > > > > > > > >
> > > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size, acker),
> > > > > where
> > > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > > represents
> > > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name to
> > > the
> > > > > > > > >     MessageId.
> > > > > > > > > - TopicMessageIdImpl: adds the topic name and the partition
> > > name
> > > > > > > > >
> > > > > > > > > These implementations are such a mess. For example, when users
> > > get
> > > > > a
> > > > > > > > > MessageId from `Producer#send`:
> > > > > > > > >
> > > > > > > > > ```java
> > > > > > > > > var id = producer.send("msg");
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > There is no getter to get some specific fields like ledger id.
> > > You
> > > > > can
> > > > > > > > > only see a representation from `toString` method and got some
> > > > > output
> > > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some details,
> > > but
> > > > > if
> > > > > > > > > users don't know the details like ledger id and entry id, how
> > > could
> > > > > > > > > you know what does "0:0:-1:0" mean? What if
> > > `MessageId#toString`'s
> > > > > > > > > implementation changed? Should it be treated as a breaking
> > > change?
> > > > > > > > >
> > > > > > > > > The original definition of the underlying MessageIdData is much
> > > > > more
> > > > > > > > > clear:
> > > > > > > > >
> > > > > > > > > ```proto
> > > > > > > > > message MessageIdData {
> > > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > > >     required uint64 entryId  = 2;
> > > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > > >     optional int32 batch_size = 6;
> > > > > > > > >
> > > > > > > > >     // For the chunk message id, we need to specify the first
> > > > > chunk message id.
> > > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > > }
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's more
> > > > > natural
> > > > > > > > > to have an interface like:
> > > > > > > > >
> > > > > > > > > ```java
> > > > > > > > > interface MessageId {
> > > > > > > > >     long ledgerId();
> > > > > > > > >     long entryId();
> > > > > > > > >     Optional<Integer> partition();
> > > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > > >     // ...
> > > > > > > > > ```
> > > > > > > >
> > > > > > > > This is very good for client applications.
> > > > > > > > We also need a way to represent this as a String or a byte[],
> > > this
> > > > > way
> > > > > > > > client applications have a standard way to store
> > > > > > > > message offsets into an external system (for instance when you
> > > want
> > > > > to
> > > > > > > > user the Reader API and keep track of the position by yourself)
> > > > > > > >
> > > > > > > > Enrico
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Additionally, there are many places that use only the triple of
> > > > > > > > > (ledger id, entry id, batch index) as the key to represent the
> > > > > position.
> > > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > > intuitive
> > > > > to
> > > > > > > > > write something like:
> > > > > > > > >
> > > > > > > > > ```java
> > > > > > > > > class MessageIdPosition implements
> > > Comparable<MessageIdPosition> {
> > > > > > > > >     private final MessageId messageId;
> > > > > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > Therefore, I'm going to write a proposal to redesign the
> > > MessageId
> > > > > > > > > interface only by adding some getters. Regarding the 5 existing
> > > > > > > > > implementations, I think we can drop them because they are a
> > > part
> > > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > > >
> > > > > > > > > Please feel free to share your points.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Yunze
> > > > >
> > >

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Yunze Xu <yz...@streamnative.io.INVALID>.
I also changed my mind after I saw Flink's MesssageIdUtils implementation.

Now it's clear to me that:
- For application users, the APIs in the pulsar-client-api module are
what they should use.
- For Pulsar ecosystem developers, the APIs in the pulsar-client
module are interfaces

So at the moment, these MessageId implementations could already be used in some
external applications.

In conclusion,
1. The MessageId interface should not be touched
2. The public methods of the MessageId implementations should not be touched

Based on these two key points, I'm going to look into these
implementations and mark
some as deprecated but they should still work like before.

Thanks,
Yunze

On Thu, Nov 10, 2022 at 3:50 AM Rajan Dhabalia <rd...@apache.org> wrote:
>
> Hi,
>
> I was reading the email thread why we want to change MessageId interface:
> https://lists.apache.org/thread/rdkqnkohbmkjjs61hvoqplhhngr0b0sd
> >> Currently we have the following 5 implementations of MessageId:
> >> These implementations are such a mess. For example, when users get a
> MessageId from `Producer#send`:
>
> I think above discussion started by stating that MessageId has multiple
> implementations eg: MessageIdImpl, BatchMessageIdImpl, ChunkMessageIdImpl,
> etc.. and as a client we receive a single MessageId with send message API.
>
> Well, I see it's expected and very well defined behavior. No matter what
> implementation client library is internally using, as a user for my topic I
> should be able to publish and consume messages by providing specific
> configurations. The moment a user has to know implementation details of a
> client library such as chunk or batch message internals, there will be a
> strong dependency created between application and server implementation
> which is a RED flag for any system. Abstraction helps user adoption by
> simplifying user API and allowing systems to enhance without worrying about
> application dependency.
> Removing abstraction of MessageId and extracting implementation information
> in getter API will prevent us to make any implementation change (eg: won't
> be able to change chunk mesasgeId behavior which we might need in future
> for Shared-subscription), introduce hacks for any enhancement by making
> sure existing internal implementation can't change (eg: if we can't change
> chunk/batch-message-Id then we might create a new wrapper and attach to
> existing messageId which will make even things worse), and even prevent us
> to change backend implementation (eg: we will be stuck with BK-Ledger
> implementation forever).
> Pulsar is used for many large scale business usecaeses with a large number
> of users and it can create a nightmare for operators and users if we remove
> abstractions and start following the practice of exposing system internals
> to user applications.
>
> Thanks,
> Rajan
>
>
> On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yz...@streamnative.io.invalid>
> wrote:
>
> > Hi Joe,
> >
> > Then what would we expect users to do with the MessageId? It should only
> > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> >
> > What about the partition index? We have a `TopicMetadata` interface that
> > returns
> > the number of partitions. If the partition is also "implementation
> > details", should we expose
> > this interface? Or should we support customizing a MessageRouter because it
> > returns the partition index?
> >
> > What about the batch index and batch size? For example, we have an
> > enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
> > index is also "implementation details", how could users know what does
> > "batch
> > index ack" mean?
> >
> > Even for ledger id and entry id, this pair represents a logic storage
> > position like the offset
> > concept in Kafka (though each offset represents a message while each
> > entry represents
> > a batch). If you see the Message API, it also exposes many attributes.
> > IMO, for the
> > MessageIdData, only the ack_set (a long array serialized from the
> > BitSet) is the implementation
> > detail.
> >
> > The MessageId API should be flexible, not an abstract one. If not, why
> > do we still implement
> > the toString() method? We should not encourage users to print the
> > MessageId. It would
> > be easy to know what "ledger is 0, entry id is 1" means, users only
> > need to know the concepts
> > of ledger id and entry id. But it would be harder to know a tuple like
> > "0:1:-1:-1" means.
> >
> > Thanks,
> > Yunze
> >
> > On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
> > >
> > > >Maybe this design is to hidden some details, but if
> > > users don't know the details like ledger id and entry id, how could
> > > you know what does "0:0:-1:0" mean?
> > >
> > >  Abstractions exist for a reason. Ledgerid and entryid are
> > implementation
> > > details, and an application should not be interpreting that at all.
> > > -j
> > >
> > >
> > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
> > > wrote:
> > >
> > > > I didn't look into these two methods at the moment. But I think it's
> > > > possible to
> > > > retain only the `fromByteArray`.
> > > >
> > > > Thanks,
> > > > Yunze
> > > >
> > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > >
> > > > > > Hi Enrico,
> > > > > >
> > > > > > > We also need a way to represent this as a String or a byte[]
> > > > > >
> > > > > > We already have the `toByteArray` method, right?
> > > > >
> > > > > Yes, correct. So we are fine. I forgot about it and I answered too
> > > > quickly.
> > > > >
> > > > > I am not sure if this can be in the scope of this initiative, but we
> > > > > should somehow get rid of
> > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > >
> > > > > Thanks
> > > > > Enrico
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Yunze
> > > > > >
> > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > eolivelli@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Currently we have the following 5 implementations of MessageId:
> > > > > > > >
> > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size, acker),
> > > > where
> > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > represents
> > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name to
> > the
> > > > > > > >     MessageId.
> > > > > > > > - TopicMessageIdImpl: adds the topic name and the partition
> > name
> > > > > > > >
> > > > > > > > These implementations are such a mess. For example, when users
> > get
> > > > a
> > > > > > > > MessageId from `Producer#send`:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > var id = producer.send("msg");
> > > > > > > > ```
> > > > > > > >
> > > > > > > > There is no getter to get some specific fields like ledger id.
> > You
> > > > can
> > > > > > > > only see a representation from `toString` method and got some
> > > > output
> > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some details,
> > but
> > > > if
> > > > > > > > users don't know the details like ledger id and entry id, how
> > could
> > > > > > > > you know what does "0:0:-1:0" mean? What if
> > `MessageId#toString`'s
> > > > > > > > implementation changed? Should it be treated as a breaking
> > change?
> > > > > > > >
> > > > > > > > The original definition of the underlying MessageIdData is much
> > > > more
> > > > > > > > clear:
> > > > > > > >
> > > > > > > > ```proto
> > > > > > > > message MessageIdData {
> > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > >     required uint64 entryId  = 2;
> > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > >     optional int32 batch_size = 6;
> > > > > > > >
> > > > > > > >     // For the chunk message id, we need to specify the first
> > > > chunk message id.
> > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's more
> > > > natural
> > > > > > > > to have an interface like:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > interface MessageId {
> > > > > > > >     long ledgerId();
> > > > > > > >     long entryId();
> > > > > > > >     Optional<Integer> partition();
> > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > >     // ...
> > > > > > > > ```
> > > > > > >
> > > > > > > This is very good for client applications.
> > > > > > > We also need a way to represent this as a String or a byte[],
> > this
> > > > way
> > > > > > > client applications have a standard way to store
> > > > > > > message offsets into an external system (for instance when you
> > want
> > > > to
> > > > > > > user the Reader API and keep track of the position by yourself)
> > > > > > >
> > > > > > > Enrico
> > > > > > >
> > > > > > > >
> > > > > > > > Additionally, there are many places that use only the triple of
> > > > > > > > (ledger id, entry id, batch index) as the key to represent the
> > > > position.
> > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > intuitive
> > > > to
> > > > > > > > write something like:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > class MessageIdPosition implements
> > Comparable<MessageIdPosition> {
> > > > > > > >     private final MessageId messageId;
> > > > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > > > ```
> > > > > > > >
> > > > > > > > Therefore, I'm going to write a proposal to redesign the
> > MessageId
> > > > > > > > interface only by adding some getters. Regarding the 5 existing
> > > > > > > > implementations, I think we can drop them because they are a
> > part
> > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > >
> > > > > > > > Please feel free to share your points.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yunze
> > > >
> >

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Rajan Dhabalia <rd...@apache.org>.
Hi,

I was reading the email thread why we want to change MessageId interface:
https://lists.apache.org/thread/rdkqnkohbmkjjs61hvoqplhhngr0b0sd
>> Currently we have the following 5 implementations of MessageId:
>> These implementations are such a mess. For example, when users get a
MessageId from `Producer#send`:

I think above discussion started by stating that MessageId has multiple
implementations eg: MessageIdImpl, BatchMessageIdImpl, ChunkMessageIdImpl,
etc.. and as a client we receive a single MessageId with send message API.

Well, I see it's expected and very well defined behavior. No matter what
implementation client library is internally using, as a user for my topic I
should be able to publish and consume messages by providing specific
configurations. The moment a user has to know implementation details of a
client library such as chunk or batch message internals, there will be a
strong dependency created between application and server implementation
which is a RED flag for any system. Abstraction helps user adoption by
simplifying user API and allowing systems to enhance without worrying about
application dependency.
Removing abstraction of MessageId and extracting implementation information
in getter API will prevent us to make any implementation change (eg: won't
be able to change chunk mesasgeId behavior which we might need in future
for Shared-subscription), introduce hacks for any enhancement by making
sure existing internal implementation can't change (eg: if we can't change
chunk/batch-message-Id then we might create a new wrapper and attach to
existing messageId which will make even things worse), and even prevent us
to change backend implementation (eg: we will be stuck with BK-Ledger
implementation forever).
Pulsar is used for many large scale business usecaeses with a large number
of users and it can create a nightmare for operators and users if we remove
abstractions and start following the practice of exposing system internals
to user applications.

Thanks,
Rajan


On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yz...@streamnative.io.invalid>
wrote:

> Hi Joe,
>
> Then what would we expect users to do with the MessageId? It should only
> be passed to Consumer#seek or ReaderBuilder#startMessageId?
>
> What about the partition index? We have a `TopicMetadata` interface that
> returns
> the number of partitions. If the partition is also "implementation
> details", should we expose
> this interface? Or should we support customizing a MessageRouter because it
> returns the partition index?
>
> What about the batch index and batch size? For example, we have an
> enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
> index is also "implementation details", how could users know what does
> "batch
> index ack" mean?
>
> Even for ledger id and entry id, this pair represents a logic storage
> position like the offset
> concept in Kafka (though each offset represents a message while each
> entry represents
> a batch). If you see the Message API, it also exposes many attributes.
> IMO, for the
> MessageIdData, only the ack_set (a long array serialized from the
> BitSet) is the implementation
> detail.
>
> The MessageId API should be flexible, not an abstract one. If not, why
> do we still implement
> the toString() method? We should not encourage users to print the
> MessageId. It would
> be easy to know what "ledger is 0, entry id is 1" means, users only
> need to know the concepts
> of ledger id and entry id. But it would be harder to know a tuple like
> "0:1:-1:-1" means.
>
> Thanks,
> Yunze
>
> On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
> >
> > >Maybe this design is to hidden some details, but if
> > users don't know the details like ledger id and entry id, how could
> > you know what does "0:0:-1:0" mean?
> >
> >  Abstractions exist for a reason. Ledgerid and entryid are
> implementation
> > details, and an application should not be interpreting that at all.
> > -j
> >
> >
> > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
> > wrote:
> >
> > > I didn't look into these two methods at the moment. But I think it's
> > > possible to
> > > retain only the `fromByteArray`.
> > >
> > > Thanks,
> > > Yunze
> > >
> > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> > > wrote:
> > > >
> > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > >
> > > > > Hi Enrico,
> > > > >
> > > > > > We also need a way to represent this as a String or a byte[]
> > > > >
> > > > > We already have the `toByteArray` method, right?
> > > >
> > > > Yes, correct. So we are fine. I forgot about it and I answered too
> > > quickly.
> > > >
> > > > I am not sure if this can be in the scope of this initiative, but we
> > > > should somehow get rid of
> > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > >
> > > > Thanks
> > > > Enrico
> > > >
> > > > >
> > > > > Thanks,
> > > > > Yunze
> > > > >
> > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> eolivelli@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Currently we have the following 5 implementations of MessageId:
> > > > > > >
> > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size, acker),
> > > where
> > > > > > >     acker is a wrapper of a BitSet.
> > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> represents
> > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name to
> the
> > > > > > >     MessageId.
> > > > > > > - TopicMessageIdImpl: adds the topic name and the partition
> name
> > > > > > >
> > > > > > > These implementations are such a mess. For example, when users
> get
> > > a
> > > > > > > MessageId from `Producer#send`:
> > > > > > >
> > > > > > > ```java
> > > > > > > var id = producer.send("msg");
> > > > > > > ```
> > > > > > >
> > > > > > > There is no getter to get some specific fields like ledger id.
> You
> > > can
> > > > > > > only see a representation from `toString` method and got some
> > > output
> > > > > > > like "0:0:-1:0". Maybe this design is to hidden some details,
> but
> > > if
> > > > > > > users don't know the details like ledger id and entry id, how
> could
> > > > > > > you know what does "0:0:-1:0" mean? What if
> `MessageId#toString`'s
> > > > > > > implementation changed? Should it be treated as a breaking
> change?
> > > > > > >
> > > > > > > The original definition of the underlying MessageIdData is much
> > > more
> > > > > > > clear:
> > > > > > >
> > > > > > > ```proto
> > > > > > > message MessageIdData {
> > > > > > >     required uint64 ledgerId = 1;
> > > > > > >     required uint64 entryId  = 2;
> > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > >     repeated int64 ack_set = 5;
> > > > > > >     optional int32 batch_size = 6;
> > > > > > >
> > > > > > >     // For the chunk message id, we need to specify the first
> > > chunk message id.
> > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > }
> > > > > > > ```
> > > > > > >
> > > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's more
> > > natural
> > > > > > > to have an interface like:
> > > > > > >
> > > > > > > ```java
> > > > > > > interface MessageId {
> > > > > > >     long ledgerId();
> > > > > > >     long entryId();
> > > > > > >     Optional<Integer> partition();
> > > > > > >     Optional<Integer> batchIndex();
> > > > > > >     // ...
> > > > > > > ```
> > > > > >
> > > > > > This is very good for client applications.
> > > > > > We also need a way to represent this as a String or a byte[],
> this
> > > way
> > > > > > client applications have a standard way to store
> > > > > > message offsets into an external system (for instance when you
> want
> > > to
> > > > > > user the Reader API and keep track of the position by yourself)
> > > > > >
> > > > > > Enrico
> > > > > >
> > > > > > >
> > > > > > > Additionally, there are many places that use only the triple of
> > > > > > > (ledger id, entry id, batch index) as the key to represent the
> > > position.
> > > > > > > Currently, they are done by adding a conversion from
> > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> intuitive
> > > to
> > > > > > > write something like:
> > > > > > >
> > > > > > > ```java
> > > > > > > class MessageIdPosition implements
> Comparable<MessageIdPosition> {
> > > > > > >     private final MessageId messageId;
> > > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > > ```
> > > > > > >
> > > > > > > Therefore, I'm going to write a proposal to redesign the
> MessageId
> > > > > > > interface only by adding some getters. Regarding the 5 existing
> > > > > > > implementations, I think we can drop them because they are a
> part
> > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > >
> > > > > > > Please feel free to share your points.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yunze
> > >
>

Re: [DISCUSSION] Redesign the MessageId interface

Posted by tison <wa...@gmail.com>.
FWIW, the Flink Pulsar connector hacky parses the message id internals to
get the next message id:
https://github.com/apache/flink/blob/421f057a7488fd64854a82424755f76b89561a0b/flink-connectors/flink-connector-pulsar/src/main/java/org/apache/flink/connector/pulsar/source/enumerator/cursor/MessageIdUtils.java

Enrico Olivelli <eo...@gmail.com>于2022年11月10日 周四01:03写道:

> After reading Joe's comments I have changed my mind.
> Actually it is better to not expose "ledgerId" and "entryId" to client
> applications.
> They are useless pieces of information.
> And also if in the future we want to change the way we internally
> address a message we will always have to support these fields.
>
> toByteArray() is enough for an application to save the ID into an
> external database and then to recover a Subscription (or a Reader)
> from a certain point.
> toString() is good only for debug/logs, we can change it but it is
> better to not touch it and add "tests"
>
> It is better that the MessageId API stays as opaque as possible.
>
> Enrico
>
>
> Il giorno mer 9 nov 2022 alle ore 15:50 Yunze Xu
> <yz...@streamnative.io.invalid> ha scritto:
> >
> > Hi Jiaqi,
> >
> > > I don't think `tostring` should be used in any serious case because it
> has
> > no standard.
> >
> > I agree. But it's better to keep it not changed. Just like my previous
> reply, it
> > might be a de-facto standard because the `toString()` like methods are
> used
> > in logging, not only for debugging. For example, there is a
> getLastMessageId
> > API in consumer and users might log the last message ID.
> >
> > Different representations across different languages is not a big issue
> but it
> > could make users and administrators confused.
> >
> > Here is an example that the C++ client supports constructing a MessageId
> with
> > 4 arguments, but the 1st argument is the partition, not the ledger id.
> > However, the
> > string representation is still
> > "<ledger-id>:<entry-id>:<partition>:<batch-index>". Though
> > in Java client a non-batched message ID doesn't have the
> > ":<batch-index>" suffix.
> >
> > Thanks,
> > Yunze
> >
> > On Wed, Nov 9, 2022 at 9:13 PM Jiaqi Shen <gl...@gmail.com>
> wrote:
> > >
> > > Thanks, this is very inspiring to me.
> > >
> > > But I have a different opinion on `tostring`.
> > >
> > > >>You can only see a representation from `toString` method and got some
> > > output like "0:0:-1:0".
> > >
> > > I don't think `tostring` should be used in any serious case because it
> has
> > > no standard. There is no constraint on how the messageId should be
> > > converted to a string. For example, in go client, `tostring` is not
> being
> > > supported now. If go client should implement a `tostring` method, does
> go
> > > client' s`tostring` must follow the java implement like "0:0:-1:0"?
> > >
> > > If user do need a string/[]byte to record a messageId, `toByteArray`
> will
> > > be enough. In user side, most of the time , I think users don't really
> care
> > > about the "messageId string" is meaningful. I think `tostring` only
> should
> > > be used in debug.
> > >
> > > Thanks,
> > > Jiaqi Shen
> > >
> > >
> > > Joe F <jo...@gmail.com> 于2022年11月9日周三 20:25写道:
> > >
> > > > Messageid is an identifier which identifies a message.  How that id
> is
> > > > constructed, or what it contains should not  matter to an
> application,  and
> > > > an application should not assume anything about the implementation
> of that
> > > > id.
> > > >
> > > > >What about the partition index? We have a `TopicMetadata` interface
> that
> > > > returns the number of partitions.
> > > >
> > > > Partitioning is a first class concept, and is  designed to be used by
> > > > application.  How a partition is implemented  should not be used by
> the
> > > > application .
> > > >
> > > >  [ People violate this all the time, and I regret that Pulsar did not
> > > > provide get_Nth_topicpartion(), which led to people hardcoding it  as
> > > > topicname-N. and using that directly.  Now we are stuck with it.]
> > > >
> > > >  Similarly batch index and batch size. Those are all logical concepts
> > > > exposed to the user.  For eg: batch size is something the app is
> allowed to
> > > > tune
> > > >
> > > > >Even for ledger id and entry id, this pair represents a logic
> storage
> > > > position like the offset concept in Kafka
> > > > These are not equivalent.   In Pulsar these are implementation
> details,
> > > > while in Kafka those are logical concepts.
> > > >
> > > > One might think that these are logical concepts in Pulsar, because
> if you
> > > > reverse engineer the current msgid implementation, you observe some
> > > > "properties".
> > > >
> > > > Ledger id/entry id are logical concepts in __Bookkeeper__ , not  in
> Pulsar.
> > > > There is the Managed Ledger abstraction on top of BK, and then there
> is
> > > > Pulsar on top of ML. You will break two levels of abstraction to
> expose
> > > > ledger/entryid to an application
> > > >
> > > > An application  should only care about the  operations that  can be
> done
> > > > with a messageId
> > > >
> > > > - getmsgid() to return the message id  as an opaque object
> > > >
> > > > [Operators   using  one messageId ]
> > > > -serde,   like tostring(). for storage/retrieval of message
> identifier
> > > > -getter/setter on logical properties of the message (partition id
> etc...)
> > > > -increment/decrement
> > > >
> > > > [Operators that take multiple messageIds]
> > > > -comparator
> > > > -range
> > > >
> > > > Those are the kind of operators Pulsar should provide to a user.
> > > > Applications should not implement these operators on their own by
> reverse
> > > > engineering the msgId. No application should be directly using
> ledgerid or
> > > > entryid for doing anything (math or logic),
> > > >
> > > >   As long as Pulsar provides  these operations  with msgid to the
> > > > application,  it should not care whether it's represented as
> "0:1:-1:-1"
> > > > or  "a:b:-b-b", or   "#xba4231!haxcy1826923f" or as a serialized
> binary
> > > > object or..whatever it may be.
> > > >
> > > > >>But it would be harder to know a tuple like "0:1:-1:-1" means.
> > > >
> > > > A user shouldn't have to know what this means. That's the point.
> > > >
> > > > Pulsar itself changed the messageId multiple times as it added
> > > > partitioning, batching and so on, and it might do so again. And
> bookkeeper
> > > > could change its representation of  ledgers, (for eg,  to uuids and
> byte
> > > > offsets)  ML could replace BK with something else  (for eg.  a table
> in a
> > > > db.)  Anything is possible - Pulsar would then just have to change
> the
> > > > implementation of the operator functions, and no application needs
> to be
> > > > rewritten.
> > > >
> > > > -j
> > > >
> > > > On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yzxu@streamnative.io.invalid
> >
> > > > wrote:
> > > >
> > > > > Hi Joe,
> > > > >
> > > > > Then what would we expect users to do with the MessageId? It
> should only
> > > > > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> > > > >
> > > > > What about the partition index? We have a `TopicMetadata`
> interface that
> > > > > returns
> > > > > the number of partitions. If the partition is also "implementation
> > > > > details", should we expose
> > > > > this interface? Or should we support customizing a MessageRouter
> because
> > > > it
> > > > > returns the partition index?
> > > > >
> > > > > What about the batch index and batch size? For example, we have an
> > > > > enableBatchIndexAcknowledgment method to enable batch index ACK.
> If batch
> > > > > index is also "implementation details", how could users know what
> does
> > > > > "batch
> > > > > index ack" mean?
> > > > >
> > > > > Even for ledger id and entry id, this pair represents a logic
> storage
> > > > > position like the offset
> > > > > concept in Kafka (though each offset represents a message while
> each
> > > > > entry represents
> > > > > a batch). If you see the Message API, it also exposes many
> attributes.
> > > > > IMO, for the
> > > > > MessageIdData, only the ack_set (a long array serialized from the
> > > > > BitSet) is the implementation
> > > > > detail.
> > > > >
> > > > > The MessageId API should be flexible, not an abstract one. If not,
> why
> > > > > do we still implement
> > > > > the toString() method? We should not encourage users to print the
> > > > > MessageId. It would
> > > > > be easy to know what "ledger is 0, entry id is 1" means, users only
> > > > > need to know the concepts
> > > > > of ledger id and entry id. But it would be harder to know a tuple
> like
> > > > > "0:1:-1:-1" means.
> > > > >
> > > > > Thanks,
> > > > > Yunze
> > > > >
> > > > > On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com>
> wrote:
> > > > > >
> > > > > > >Maybe this design is to hidden some details, but if
> > > > > > users don't know the details like ledger id and entry id, how
> could
> > > > > > you know what does "0:0:-1:0" mean?
> > > > > >
> > > > > >  Abstractions exist for a reason. Ledgerid and entryid are
> > > > > implementation
> > > > > > details, and an application should not be interpreting that at
> all.
> > > > > > -j
> > > > > >
> > > > > >
> > > > > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu
> <yz...@streamnative.io.invalid>
> > > > > > wrote:
> > > > > >
> > > > > > > I didn't look into these two methods at the moment. But I
> think it's
> > > > > > > possible to
> > > > > > > retain only the `fromByteArray`.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yunze
> > > > > > >
> > > > > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <
> eolivelli@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > > >
> > > > > > > > > Hi Enrico,
> > > > > > > > >
> > > > > > > > > > We also need a way to represent this as a String or a
> byte[]
> > > > > > > > >
> > > > > > > > > We already have the `toByteArray` method, right?
> > > > > > > >
> > > > > > > > Yes, correct. So we are fine. I forgot about it and I
> answered too
> > > > > > > quickly.
> > > > > > > >
> > > > > > > > I am not sure if this can be in the scope of this
> initiative, but
> > > > we
> > > > > > > > should somehow get rid of
> > > > > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > > Enrico
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Yunze
> > > > > > > > >
> > > > > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > > > > eolivelli@gmail.com>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > > > > >
> > > > > > > > > > > Hi all,
> > > > > > > > > > >
> > > > > > > > > > > Currently we have the following 5 implementations of
> > > > MessageId:
> > > > > > > > > > >
> > > > > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size,
> > > > acker),
> > > > > > > where
> > > > > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > > > > represents
> > > > > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic
> name
> > > > to
> > > > > the
> > > > > > > > > > >     MessageId.
> > > > > > > > > > > - TopicMessageIdImpl: adds the topic name and the
> partition
> > > > > name
> > > > > > > > > > >
> > > > > > > > > > > These implementations are such a mess. For example,
> when
> > > > users
> > > > > get
> > > > > > > a
> > > > > > > > > > > MessageId from `Producer#send`:
> > > > > > > > > > >
> > > > > > > > > > > ```java
> > > > > > > > > > > var id = producer.send("msg");
> > > > > > > > > > > ```
> > > > > > > > > > >
> > > > > > > > > > > There is no getter to get some specific fields like
> ledger
> > > > id.
> > > > > You
> > > > > > > can
> > > > > > > > > > > only see a representation from `toString` method and
> got some
> > > > > > > output
> > > > > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some
> details,
> > > > > but
> > > > > > > if
> > > > > > > > > > > users don't know the details like ledger id and entry
> id, how
> > > > > could
> > > > > > > > > > > you know what does "0:0:-1:0" mean? What if
> > > > > `MessageId#toString`'s
> > > > > > > > > > > implementation changed? Should it be treated as a
> breaking
> > > > > change?
> > > > > > > > > > >
> > > > > > > > > > > The original definition of the underlying
> MessageIdData is
> > > > much
> > > > > > > more
> > > > > > > > > > > clear:
> > > > > > > > > > >
> > > > > > > > > > > ```proto
> > > > > > > > > > > message MessageIdData {
> > > > > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > > > > >     required uint64 entryId  = 2;
> > > > > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > > > > >     optional int32 batch_size = 6;
> > > > > > > > > > >
> > > > > > > > > > >     // For the chunk message id, we need to specify
> the first
> > > > > > > chunk message id.
> > > > > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > > > > }
> > > > > > > > > > > ```
> > > > > > > > > > >
> > > > > > > > > > > IMO, MessageId should be a wrapper of MessageIdData.
> It's
> > > > more
> > > > > > > natural
> > > > > > > > > > > to have an interface like:
> > > > > > > > > > >
> > > > > > > > > > > ```java
> > > > > > > > > > > interface MessageId {
> > > > > > > > > > >     long ledgerId();
> > > > > > > > > > >     long entryId();
> > > > > > > > > > >     Optional<Integer> partition();
> > > > > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > > > > >     // ...
> > > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > This is very good for client applications.
> > > > > > > > > > We also need a way to represent this as a String or a
> byte[],
> > > > > this
> > > > > > > way
> > > > > > > > > > client applications have a standard way to store
> > > > > > > > > > message offsets into an external system (for instance
> when you
> > > > > want
> > > > > > > to
> > > > > > > > > > user the Reader API and keep track of the position by
> yourself)
> > > > > > > > > >
> > > > > > > > > > Enrico
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Additionally, there are many places that use only the
> triple
> > > > of
> > > > > > > > > > > (ledger id, entry id, batch index) as the key to
> represent
> > > > the
> > > > > > > position.
> > > > > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > > > > intuitive
> > > > > > > to
> > > > > > > > > > > write something like:
> > > > > > > > > > >
> > > > > > > > > > > ```java
> > > > > > > > > > > class MessageIdPosition implements
> > > > > Comparable<MessageIdPosition> {
> > > > > > > > > > >     private final MessageId messageId;
> > > > > > > > > > >     // TODO: compare only the triple (ledger, entry,
> batch)
> > > > > > > > > > > ```
> > > > > > > > > > >
> > > > > > > > > > > Therefore, I'm going to write a proposal to redesign
> the
> > > > > MessageId
> > > > > > > > > > > interface only by adding some getters. Regarding the 5
> > > > existing
> > > > > > > > > > > implementations, I think we can drop them because they
> are a
> > > > > part
> > > > > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > > > > >
> > > > > > > > > > > Please feel free to share your points.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Yunze
> > > > > > >
> > > > >
> > > >
>
-- 
Best,
tison.

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Enrico Olivelli <eo...@gmail.com>.
After reading Joe's comments I have changed my mind.
Actually it is better to not expose "ledgerId" and "entryId" to client
applications.
They are useless pieces of information.
And also if in the future we want to change the way we internally
address a message we will always have to support these fields.

toByteArray() is enough for an application to save the ID into an
external database and then to recover a Subscription (or a Reader)
from a certain point.
toString() is good only for debug/logs, we can change it but it is
better to not touch it and add "tests"

It is better that the MessageId API stays as opaque as possible.

Enrico


Il giorno mer 9 nov 2022 alle ore 15:50 Yunze Xu
<yz...@streamnative.io.invalid> ha scritto:
>
> Hi Jiaqi,
>
> > I don't think `tostring` should be used in any serious case because it has
> no standard.
>
> I agree. But it's better to keep it not changed. Just like my previous reply, it
> might be a de-facto standard because the `toString()` like methods are used
> in logging, not only for debugging. For example, there is a getLastMessageId
> API in consumer and users might log the last message ID.
>
> Different representations across different languages is not a big issue but it
> could make users and administrators confused.
>
> Here is an example that the C++ client supports constructing a MessageId with
> 4 arguments, but the 1st argument is the partition, not the ledger id.
> However, the
> string representation is still
> "<ledger-id>:<entry-id>:<partition>:<batch-index>". Though
> in Java client a non-batched message ID doesn't have the
> ":<batch-index>" suffix.
>
> Thanks,
> Yunze
>
> On Wed, Nov 9, 2022 at 9:13 PM Jiaqi Shen <gl...@gmail.com> wrote:
> >
> > Thanks, this is very inspiring to me.
> >
> > But I have a different opinion on `tostring`.
> >
> > >>You can only see a representation from `toString` method and got some
> > output like "0:0:-1:0".
> >
> > I don't think `tostring` should be used in any serious case because it has
> > no standard. There is no constraint on how the messageId should be
> > converted to a string. For example, in go client, `tostring` is not being
> > supported now. If go client should implement a `tostring` method, does go
> > client' s`tostring` must follow the java implement like "0:0:-1:0"?
> >
> > If user do need a string/[]byte to record a messageId, `toByteArray` will
> > be enough. In user side, most of the time , I think users don't really care
> > about the "messageId string" is meaningful. I think `tostring` only should
> > be used in debug.
> >
> > Thanks,
> > Jiaqi Shen
> >
> >
> > Joe F <jo...@gmail.com> 于2022年11月9日周三 20:25写道:
> >
> > > Messageid is an identifier which identifies a message.  How that id is
> > > constructed, or what it contains should not  matter to an application,  and
> > > an application should not assume anything about the implementation of that
> > > id.
> > >
> > > >What about the partition index? We have a `TopicMetadata` interface that
> > > returns the number of partitions.
> > >
> > > Partitioning is a first class concept, and is  designed to be used by
> > > application.  How a partition is implemented  should not be used by the
> > > application .
> > >
> > >  [ People violate this all the time, and I regret that Pulsar did not
> > > provide get_Nth_topicpartion(), which led to people hardcoding it  as
> > > topicname-N. and using that directly.  Now we are stuck with it.]
> > >
> > >  Similarly batch index and batch size. Those are all logical concepts
> > > exposed to the user.  For eg: batch size is something the app is allowed to
> > > tune
> > >
> > > >Even for ledger id and entry id, this pair represents a logic storage
> > > position like the offset concept in Kafka
> > > These are not equivalent.   In Pulsar these are implementation details,
> > > while in Kafka those are logical concepts.
> > >
> > > One might think that these are logical concepts in Pulsar, because if you
> > > reverse engineer the current msgid implementation, you observe some
> > > "properties".
> > >
> > > Ledger id/entry id are logical concepts in __Bookkeeper__ , not  in Pulsar.
> > > There is the Managed Ledger abstraction on top of BK, and then there is
> > > Pulsar on top of ML. You will break two levels of abstraction to expose
> > > ledger/entryid to an application
> > >
> > > An application  should only care about the  operations that  can be done
> > > with a messageId
> > >
> > > - getmsgid() to return the message id  as an opaque object
> > >
> > > [Operators   using  one messageId ]
> > > -serde,   like tostring(). for storage/retrieval of message identifier
> > > -getter/setter on logical properties of the message (partition id etc...)
> > > -increment/decrement
> > >
> > > [Operators that take multiple messageIds]
> > > -comparator
> > > -range
> > >
> > > Those are the kind of operators Pulsar should provide to a user.
> > > Applications should not implement these operators on their own by reverse
> > > engineering the msgId. No application should be directly using ledgerid or
> > > entryid for doing anything (math or logic),
> > >
> > >   As long as Pulsar provides  these operations  with msgid to the
> > > application,  it should not care whether it's represented as "0:1:-1:-1"
> > > or  "a:b:-b-b", or   "#xba4231!haxcy1826923f" or as a serialized binary
> > > object or..whatever it may be.
> > >
> > > >>But it would be harder to know a tuple like "0:1:-1:-1" means.
> > >
> > > A user shouldn't have to know what this means. That's the point.
> > >
> > > Pulsar itself changed the messageId multiple times as it added
> > > partitioning, batching and so on, and it might do so again. And bookkeeper
> > > could change its representation of  ledgers, (for eg,  to uuids and byte
> > > offsets)  ML could replace BK with something else  (for eg.  a table in a
> > > db.)  Anything is possible - Pulsar would then just have to change the
> > > implementation of the operator functions, and no application needs to be
> > > rewritten.
> > >
> > > -j
> > >
> > > On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yz...@streamnative.io.invalid>
> > > wrote:
> > >
> > > > Hi Joe,
> > > >
> > > > Then what would we expect users to do with the MessageId? It should only
> > > > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> > > >
> > > > What about the partition index? We have a `TopicMetadata` interface that
> > > > returns
> > > > the number of partitions. If the partition is also "implementation
> > > > details", should we expose
> > > > this interface? Or should we support customizing a MessageRouter because
> > > it
> > > > returns the partition index?
> > > >
> > > > What about the batch index and batch size? For example, we have an
> > > > enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
> > > > index is also "implementation details", how could users know what does
> > > > "batch
> > > > index ack" mean?
> > > >
> > > > Even for ledger id and entry id, this pair represents a logic storage
> > > > position like the offset
> > > > concept in Kafka (though each offset represents a message while each
> > > > entry represents
> > > > a batch). If you see the Message API, it also exposes many attributes.
> > > > IMO, for the
> > > > MessageIdData, only the ack_set (a long array serialized from the
> > > > BitSet) is the implementation
> > > > detail.
> > > >
> > > > The MessageId API should be flexible, not an abstract one. If not, why
> > > > do we still implement
> > > > the toString() method? We should not encourage users to print the
> > > > MessageId. It would
> > > > be easy to know what "ledger is 0, entry id is 1" means, users only
> > > > need to know the concepts
> > > > of ledger id and entry id. But it would be harder to know a tuple like
> > > > "0:1:-1:-1" means.
> > > >
> > > > Thanks,
> > > > Yunze
> > > >
> > > > On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
> > > > >
> > > > > >Maybe this design is to hidden some details, but if
> > > > > users don't know the details like ledger id and entry id, how could
> > > > > you know what does "0:0:-1:0" mean?
> > > > >
> > > > >  Abstractions exist for a reason. Ledgerid and entryid are
> > > > implementation
> > > > > details, and an application should not be interpreting that at all.
> > > > > -j
> > > > >
> > > > >
> > > > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
> > > > > wrote:
> > > > >
> > > > > > I didn't look into these two methods at the moment. But I think it's
> > > > > > possible to
> > > > > > retain only the `fromByteArray`.
> > > > > >
> > > > > > Thanks,
> > > > > > Yunze
> > > > > >
> > > > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > >
> > > > > > > > Hi Enrico,
> > > > > > > >
> > > > > > > > > We also need a way to represent this as a String or a byte[]
> > > > > > > >
> > > > > > > > We already have the `toByteArray` method, right?
> > > > > > >
> > > > > > > Yes, correct. So we are fine. I forgot about it and I answered too
> > > > > > quickly.
> > > > > > >
> > > > > > > I am not sure if this can be in the scope of this initiative, but
> > > we
> > > > > > > should somehow get rid of
> > > > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > > > >
> > > > > > > Thanks
> > > > > > > Enrico
> > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yunze
> > > > > > > >
> > > > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > > > eolivelli@gmail.com>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Currently we have the following 5 implementations of
> > > MessageId:
> > > > > > > > > >
> > > > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size,
> > > acker),
> > > > > > where
> > > > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > > > represents
> > > > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name
> > > to
> > > > the
> > > > > > > > > >     MessageId.
> > > > > > > > > > - TopicMessageIdImpl: adds the topic name and the partition
> > > > name
> > > > > > > > > >
> > > > > > > > > > These implementations are such a mess. For example, when
> > > users
> > > > get
> > > > > > a
> > > > > > > > > > MessageId from `Producer#send`:
> > > > > > > > > >
> > > > > > > > > > ```java
> > > > > > > > > > var id = producer.send("msg");
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > There is no getter to get some specific fields like ledger
> > > id.
> > > > You
> > > > > > can
> > > > > > > > > > only see a representation from `toString` method and got some
> > > > > > output
> > > > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some details,
> > > > but
> > > > > > if
> > > > > > > > > > users don't know the details like ledger id and entry id, how
> > > > could
> > > > > > > > > > you know what does "0:0:-1:0" mean? What if
> > > > `MessageId#toString`'s
> > > > > > > > > > implementation changed? Should it be treated as a breaking
> > > > change?
> > > > > > > > > >
> > > > > > > > > > The original definition of the underlying MessageIdData is
> > > much
> > > > > > more
> > > > > > > > > > clear:
> > > > > > > > > >
> > > > > > > > > > ```proto
> > > > > > > > > > message MessageIdData {
> > > > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > > > >     required uint64 entryId  = 2;
> > > > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > > > >     optional int32 batch_size = 6;
> > > > > > > > > >
> > > > > > > > > >     // For the chunk message id, we need to specify the first
> > > > > > chunk message id.
> > > > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > > > }
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's
> > > more
> > > > > > natural
> > > > > > > > > > to have an interface like:
> > > > > > > > > >
> > > > > > > > > > ```java
> > > > > > > > > > interface MessageId {
> > > > > > > > > >     long ledgerId();
> > > > > > > > > >     long entryId();
> > > > > > > > > >     Optional<Integer> partition();
> > > > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > > > >     // ...
> > > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > This is very good for client applications.
> > > > > > > > > We also need a way to represent this as a String or a byte[],
> > > > this
> > > > > > way
> > > > > > > > > client applications have a standard way to store
> > > > > > > > > message offsets into an external system (for instance when you
> > > > want
> > > > > > to
> > > > > > > > > user the Reader API and keep track of the position by yourself)
> > > > > > > > >
> > > > > > > > > Enrico
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Additionally, there are many places that use only the triple
> > > of
> > > > > > > > > > (ledger id, entry id, batch index) as the key to represent
> > > the
> > > > > > position.
> > > > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > > > intuitive
> > > > > > to
> > > > > > > > > > write something like:
> > > > > > > > > >
> > > > > > > > > > ```java
> > > > > > > > > > class MessageIdPosition implements
> > > > Comparable<MessageIdPosition> {
> > > > > > > > > >     private final MessageId messageId;
> > > > > > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > > > > > ```
> > > > > > > > > >
> > > > > > > > > > Therefore, I'm going to write a proposal to redesign the
> > > > MessageId
> > > > > > > > > > interface only by adding some getters. Regarding the 5
> > > existing
> > > > > > > > > > implementations, I think we can drop them because they are a
> > > > part
> > > > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > > > >
> > > > > > > > > > Please feel free to share your points.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Yunze
> > > > > >
> > > >
> > >

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Yunze Xu <yz...@streamnative.io.INVALID>.
Hi Jiaqi,

> I don't think `tostring` should be used in any serious case because it has
no standard.

I agree. But it's better to keep it not changed. Just like my previous reply, it
might be a de-facto standard because the `toString()` like methods are used
in logging, not only for debugging. For example, there is a getLastMessageId
API in consumer and users might log the last message ID.

Different representations across different languages is not a big issue but it
could make users and administrators confused.

Here is an example that the C++ client supports constructing a MessageId with
4 arguments, but the 1st argument is the partition, not the ledger id.
However, the
string representation is still
"<ledger-id>:<entry-id>:<partition>:<batch-index>". Though
in Java client a non-batched message ID doesn't have the
":<batch-index>" suffix.

Thanks,
Yunze

On Wed, Nov 9, 2022 at 9:13 PM Jiaqi Shen <gl...@gmail.com> wrote:
>
> Thanks, this is very inspiring to me.
>
> But I have a different opinion on `tostring`.
>
> >>You can only see a representation from `toString` method and got some
> output like "0:0:-1:0".
>
> I don't think `tostring` should be used in any serious case because it has
> no standard. There is no constraint on how the messageId should be
> converted to a string. For example, in go client, `tostring` is not being
> supported now. If go client should implement a `tostring` method, does go
> client' s`tostring` must follow the java implement like "0:0:-1:0"?
>
> If user do need a string/[]byte to record a messageId, `toByteArray` will
> be enough. In user side, most of the time , I think users don't really care
> about the "messageId string" is meaningful. I think `tostring` only should
> be used in debug.
>
> Thanks,
> Jiaqi Shen
>
>
> Joe F <jo...@gmail.com> 于2022年11月9日周三 20:25写道:
>
> > Messageid is an identifier which identifies a message.  How that id is
> > constructed, or what it contains should not  matter to an application,  and
> > an application should not assume anything about the implementation of that
> > id.
> >
> > >What about the partition index? We have a `TopicMetadata` interface that
> > returns the number of partitions.
> >
> > Partitioning is a first class concept, and is  designed to be used by
> > application.  How a partition is implemented  should not be used by the
> > application .
> >
> >  [ People violate this all the time, and I regret that Pulsar did not
> > provide get_Nth_topicpartion(), which led to people hardcoding it  as
> > topicname-N. and using that directly.  Now we are stuck with it.]
> >
> >  Similarly batch index and batch size. Those are all logical concepts
> > exposed to the user.  For eg: batch size is something the app is allowed to
> > tune
> >
> > >Even for ledger id and entry id, this pair represents a logic storage
> > position like the offset concept in Kafka
> > These are not equivalent.   In Pulsar these are implementation details,
> > while in Kafka those are logical concepts.
> >
> > One might think that these are logical concepts in Pulsar, because if you
> > reverse engineer the current msgid implementation, you observe some
> > "properties".
> >
> > Ledger id/entry id are logical concepts in __Bookkeeper__ , not  in Pulsar.
> > There is the Managed Ledger abstraction on top of BK, and then there is
> > Pulsar on top of ML. You will break two levels of abstraction to expose
> > ledger/entryid to an application
> >
> > An application  should only care about the  operations that  can be done
> > with a messageId
> >
> > - getmsgid() to return the message id  as an opaque object
> >
> > [Operators   using  one messageId ]
> > -serde,   like tostring(). for storage/retrieval of message identifier
> > -getter/setter on logical properties of the message (partition id etc...)
> > -increment/decrement
> >
> > [Operators that take multiple messageIds]
> > -comparator
> > -range
> >
> > Those are the kind of operators Pulsar should provide to a user.
> > Applications should not implement these operators on their own by reverse
> > engineering the msgId. No application should be directly using ledgerid or
> > entryid for doing anything (math or logic),
> >
> >   As long as Pulsar provides  these operations  with msgid to the
> > application,  it should not care whether it's represented as "0:1:-1:-1"
> > or  "a:b:-b-b", or   "#xba4231!haxcy1826923f" or as a serialized binary
> > object or..whatever it may be.
> >
> > >>But it would be harder to know a tuple like "0:1:-1:-1" means.
> >
> > A user shouldn't have to know what this means. That's the point.
> >
> > Pulsar itself changed the messageId multiple times as it added
> > partitioning, batching and so on, and it might do so again. And bookkeeper
> > could change its representation of  ledgers, (for eg,  to uuids and byte
> > offsets)  ML could replace BK with something else  (for eg.  a table in a
> > db.)  Anything is possible - Pulsar would then just have to change the
> > implementation of the operator functions, and no application needs to be
> > rewritten.
> >
> > -j
> >
> > On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yz...@streamnative.io.invalid>
> > wrote:
> >
> > > Hi Joe,
> > >
> > > Then what would we expect users to do with the MessageId? It should only
> > > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> > >
> > > What about the partition index? We have a `TopicMetadata` interface that
> > > returns
> > > the number of partitions. If the partition is also "implementation
> > > details", should we expose
> > > this interface? Or should we support customizing a MessageRouter because
> > it
> > > returns the partition index?
> > >
> > > What about the batch index and batch size? For example, we have an
> > > enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
> > > index is also "implementation details", how could users know what does
> > > "batch
> > > index ack" mean?
> > >
> > > Even for ledger id and entry id, this pair represents a logic storage
> > > position like the offset
> > > concept in Kafka (though each offset represents a message while each
> > > entry represents
> > > a batch). If you see the Message API, it also exposes many attributes.
> > > IMO, for the
> > > MessageIdData, only the ack_set (a long array serialized from the
> > > BitSet) is the implementation
> > > detail.
> > >
> > > The MessageId API should be flexible, not an abstract one. If not, why
> > > do we still implement
> > > the toString() method? We should not encourage users to print the
> > > MessageId. It would
> > > be easy to know what "ledger is 0, entry id is 1" means, users only
> > > need to know the concepts
> > > of ledger id and entry id. But it would be harder to know a tuple like
> > > "0:1:-1:-1" means.
> > >
> > > Thanks,
> > > Yunze
> > >
> > > On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
> > > >
> > > > >Maybe this design is to hidden some details, but if
> > > > users don't know the details like ledger id and entry id, how could
> > > > you know what does "0:0:-1:0" mean?
> > > >
> > > >  Abstractions exist for a reason. Ledgerid and entryid are
> > > implementation
> > > > details, and an application should not be interpreting that at all.
> > > > -j
> > > >
> > > >
> > > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
> > > > wrote:
> > > >
> > > > > I didn't look into these two methods at the moment. But I think it's
> > > > > possible to
> > > > > retain only the `fromByteArray`.
> > > > >
> > > > > Thanks,
> > > > > Yunze
> > > > >
> > > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > >
> > > > > > > Hi Enrico,
> > > > > > >
> > > > > > > > We also need a way to represent this as a String or a byte[]
> > > > > > >
> > > > > > > We already have the `toByteArray` method, right?
> > > > > >
> > > > > > Yes, correct. So we are fine. I forgot about it and I answered too
> > > > > quickly.
> > > > > >
> > > > > > I am not sure if this can be in the scope of this initiative, but
> > we
> > > > > > should somehow get rid of
> > > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > > >
> > > > > > Thanks
> > > > > > Enrico
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yunze
> > > > > > >
> > > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > > eolivelli@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > > >
> > > > > > > > > Hi all,
> > > > > > > > >
> > > > > > > > > Currently we have the following 5 implementations of
> > MessageId:
> > > > > > > > >
> > > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size,
> > acker),
> > > > > where
> > > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > > represents
> > > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name
> > to
> > > the
> > > > > > > > >     MessageId.
> > > > > > > > > - TopicMessageIdImpl: adds the topic name and the partition
> > > name
> > > > > > > > >
> > > > > > > > > These implementations are such a mess. For example, when
> > users
> > > get
> > > > > a
> > > > > > > > > MessageId from `Producer#send`:
> > > > > > > > >
> > > > > > > > > ```java
> > > > > > > > > var id = producer.send("msg");
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > There is no getter to get some specific fields like ledger
> > id.
> > > You
> > > > > can
> > > > > > > > > only see a representation from `toString` method and got some
> > > > > output
> > > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some details,
> > > but
> > > > > if
> > > > > > > > > users don't know the details like ledger id and entry id, how
> > > could
> > > > > > > > > you know what does "0:0:-1:0" mean? What if
> > > `MessageId#toString`'s
> > > > > > > > > implementation changed? Should it be treated as a breaking
> > > change?
> > > > > > > > >
> > > > > > > > > The original definition of the underlying MessageIdData is
> > much
> > > > > more
> > > > > > > > > clear:
> > > > > > > > >
> > > > > > > > > ```proto
> > > > > > > > > message MessageIdData {
> > > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > > >     required uint64 entryId  = 2;
> > > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > > >     optional int32 batch_size = 6;
> > > > > > > > >
> > > > > > > > >     // For the chunk message id, we need to specify the first
> > > > > chunk message id.
> > > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > > }
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's
> > more
> > > > > natural
> > > > > > > > > to have an interface like:
> > > > > > > > >
> > > > > > > > > ```java
> > > > > > > > > interface MessageId {
> > > > > > > > >     long ledgerId();
> > > > > > > > >     long entryId();
> > > > > > > > >     Optional<Integer> partition();
> > > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > > >     // ...
> > > > > > > > > ```
> > > > > > > >
> > > > > > > > This is very good for client applications.
> > > > > > > > We also need a way to represent this as a String or a byte[],
> > > this
> > > > > way
> > > > > > > > client applications have a standard way to store
> > > > > > > > message offsets into an external system (for instance when you
> > > want
> > > > > to
> > > > > > > > user the Reader API and keep track of the position by yourself)
> > > > > > > >
> > > > > > > > Enrico
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Additionally, there are many places that use only the triple
> > of
> > > > > > > > > (ledger id, entry id, batch index) as the key to represent
> > the
> > > > > position.
> > > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > > intuitive
> > > > > to
> > > > > > > > > write something like:
> > > > > > > > >
> > > > > > > > > ```java
> > > > > > > > > class MessageIdPosition implements
> > > Comparable<MessageIdPosition> {
> > > > > > > > >     private final MessageId messageId;
> > > > > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > Therefore, I'm going to write a proposal to redesign the
> > > MessageId
> > > > > > > > > interface only by adding some getters. Regarding the 5
> > existing
> > > > > > > > > implementations, I think we can drop them because they are a
> > > part
> > > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > > >
> > > > > > > > > Please feel free to share your points.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Yunze
> > > > >
> > >
> >

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Jiaqi Shen <gl...@gmail.com>.
Thanks, this is very inspiring to me.

But I have a different opinion on `tostring`.

>>You can only see a representation from `toString` method and got some
output like "0:0:-1:0".

I don't think `tostring` should be used in any serious case because it has
no standard. There is no constraint on how the messageId should be
converted to a string. For example, in go client, `tostring` is not being
supported now. If go client should implement a `tostring` method, does go
client' s`tostring` must follow the java implement like "0:0:-1:0"?

If user do need a string/[]byte to record a messageId, `toByteArray` will
be enough. In user side, most of the time , I think users don't really care
about the "messageId string" is meaningful. I think `tostring` only should
be used in debug.

Thanks,
Jiaqi Shen


Joe F <jo...@gmail.com> 于2022年11月9日周三 20:25写道:

> Messageid is an identifier which identifies a message.  How that id is
> constructed, or what it contains should not  matter to an application,  and
> an application should not assume anything about the implementation of that
> id.
>
> >What about the partition index? We have a `TopicMetadata` interface that
> returns the number of partitions.
>
> Partitioning is a first class concept, and is  designed to be used by
> application.  How a partition is implemented  should not be used by the
> application .
>
>  [ People violate this all the time, and I regret that Pulsar did not
> provide get_Nth_topicpartion(), which led to people hardcoding it  as
> topicname-N. and using that directly.  Now we are stuck with it.]
>
>  Similarly batch index and batch size. Those are all logical concepts
> exposed to the user.  For eg: batch size is something the app is allowed to
> tune
>
> >Even for ledger id and entry id, this pair represents a logic storage
> position like the offset concept in Kafka
> These are not equivalent.   In Pulsar these are implementation details,
> while in Kafka those are logical concepts.
>
> One might think that these are logical concepts in Pulsar, because if you
> reverse engineer the current msgid implementation, you observe some
> "properties".
>
> Ledger id/entry id are logical concepts in __Bookkeeper__ , not  in Pulsar.
> There is the Managed Ledger abstraction on top of BK, and then there is
> Pulsar on top of ML. You will break two levels of abstraction to expose
> ledger/entryid to an application
>
> An application  should only care about the  operations that  can be done
> with a messageId
>
> - getmsgid() to return the message id  as an opaque object
>
> [Operators   using  one messageId ]
> -serde,   like tostring(). for storage/retrieval of message identifier
> -getter/setter on logical properties of the message (partition id etc...)
> -increment/decrement
>
> [Operators that take multiple messageIds]
> -comparator
> -range
>
> Those are the kind of operators Pulsar should provide to a user.
> Applications should not implement these operators on their own by reverse
> engineering the msgId. No application should be directly using ledgerid or
> entryid for doing anything (math or logic),
>
>   As long as Pulsar provides  these operations  with msgid to the
> application,  it should not care whether it's represented as "0:1:-1:-1"
> or  "a:b:-b-b", or   "#xba4231!haxcy1826923f" or as a serialized binary
> object or..whatever it may be.
>
> >>But it would be harder to know a tuple like "0:1:-1:-1" means.
>
> A user shouldn't have to know what this means. That's the point.
>
> Pulsar itself changed the messageId multiple times as it added
> partitioning, batching and so on, and it might do so again. And bookkeeper
> could change its representation of  ledgers, (for eg,  to uuids and byte
> offsets)  ML could replace BK with something else  (for eg.  a table in a
> db.)  Anything is possible - Pulsar would then just have to change the
> implementation of the operator functions, and no application needs to be
> rewritten.
>
> -j
>
> On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yz...@streamnative.io.invalid>
> wrote:
>
> > Hi Joe,
> >
> > Then what would we expect users to do with the MessageId? It should only
> > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> >
> > What about the partition index? We have a `TopicMetadata` interface that
> > returns
> > the number of partitions. If the partition is also "implementation
> > details", should we expose
> > this interface? Or should we support customizing a MessageRouter because
> it
> > returns the partition index?
> >
> > What about the batch index and batch size? For example, we have an
> > enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
> > index is also "implementation details", how could users know what does
> > "batch
> > index ack" mean?
> >
> > Even for ledger id and entry id, this pair represents a logic storage
> > position like the offset
> > concept in Kafka (though each offset represents a message while each
> > entry represents
> > a batch). If you see the Message API, it also exposes many attributes.
> > IMO, for the
> > MessageIdData, only the ack_set (a long array serialized from the
> > BitSet) is the implementation
> > detail.
> >
> > The MessageId API should be flexible, not an abstract one. If not, why
> > do we still implement
> > the toString() method? We should not encourage users to print the
> > MessageId. It would
> > be easy to know what "ledger is 0, entry id is 1" means, users only
> > need to know the concepts
> > of ledger id and entry id. But it would be harder to know a tuple like
> > "0:1:-1:-1" means.
> >
> > Thanks,
> > Yunze
> >
> > On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
> > >
> > > >Maybe this design is to hidden some details, but if
> > > users don't know the details like ledger id and entry id, how could
> > > you know what does "0:0:-1:0" mean?
> > >
> > >  Abstractions exist for a reason. Ledgerid and entryid are
> > implementation
> > > details, and an application should not be interpreting that at all.
> > > -j
> > >
> > >
> > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
> > > wrote:
> > >
> > > > I didn't look into these two methods at the moment. But I think it's
> > > > possible to
> > > > retain only the `fromByteArray`.
> > > >
> > > > Thanks,
> > > > Yunze
> > > >
> > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > >
> > > > > > Hi Enrico,
> > > > > >
> > > > > > > We also need a way to represent this as a String or a byte[]
> > > > > >
> > > > > > We already have the `toByteArray` method, right?
> > > > >
> > > > > Yes, correct. So we are fine. I forgot about it and I answered too
> > > > quickly.
> > > > >
> > > > > I am not sure if this can be in the scope of this initiative, but
> we
> > > > > should somehow get rid of
> > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > >
> > > > > Thanks
> > > > > Enrico
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Yunze
> > > > > >
> > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > eolivelli@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Currently we have the following 5 implementations of
> MessageId:
> > > > > > > >
> > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size,
> acker),
> > > > where
> > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > represents
> > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name
> to
> > the
> > > > > > > >     MessageId.
> > > > > > > > - TopicMessageIdImpl: adds the topic name and the partition
> > name
> > > > > > > >
> > > > > > > > These implementations are such a mess. For example, when
> users
> > get
> > > > a
> > > > > > > > MessageId from `Producer#send`:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > var id = producer.send("msg");
> > > > > > > > ```
> > > > > > > >
> > > > > > > > There is no getter to get some specific fields like ledger
> id.
> > You
> > > > can
> > > > > > > > only see a representation from `toString` method and got some
> > > > output
> > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some details,
> > but
> > > > if
> > > > > > > > users don't know the details like ledger id and entry id, how
> > could
> > > > > > > > you know what does "0:0:-1:0" mean? What if
> > `MessageId#toString`'s
> > > > > > > > implementation changed? Should it be treated as a breaking
> > change?
> > > > > > > >
> > > > > > > > The original definition of the underlying MessageIdData is
> much
> > > > more
> > > > > > > > clear:
> > > > > > > >
> > > > > > > > ```proto
> > > > > > > > message MessageIdData {
> > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > >     required uint64 entryId  = 2;
> > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > >     optional int32 batch_size = 6;
> > > > > > > >
> > > > > > > >     // For the chunk message id, we need to specify the first
> > > > chunk message id.
> > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's
> more
> > > > natural
> > > > > > > > to have an interface like:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > interface MessageId {
> > > > > > > >     long ledgerId();
> > > > > > > >     long entryId();
> > > > > > > >     Optional<Integer> partition();
> > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > >     // ...
> > > > > > > > ```
> > > > > > >
> > > > > > > This is very good for client applications.
> > > > > > > We also need a way to represent this as a String or a byte[],
> > this
> > > > way
> > > > > > > client applications have a standard way to store
> > > > > > > message offsets into an external system (for instance when you
> > want
> > > > to
> > > > > > > user the Reader API and keep track of the position by yourself)
> > > > > > >
> > > > > > > Enrico
> > > > > > >
> > > > > > > >
> > > > > > > > Additionally, there are many places that use only the triple
> of
> > > > > > > > (ledger id, entry id, batch index) as the key to represent
> the
> > > > position.
> > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > intuitive
> > > > to
> > > > > > > > write something like:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > class MessageIdPosition implements
> > Comparable<MessageIdPosition> {
> > > > > > > >     private final MessageId messageId;
> > > > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > > > ```
> > > > > > > >
> > > > > > > > Therefore, I'm going to write a proposal to redesign the
> > MessageId
> > > > > > > > interface only by adding some getters. Regarding the 5
> existing
> > > > > > > > implementations, I think we can drop them because they are a
> > part
> > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > >
> > > > > > > > Please feel free to share your points.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yunze
> > > >
> >
>

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Yunze Xu <yz...@streamnative.io.INVALID>.
Hi Joe,

I think the most controversial point is what should a MessageId be used for.
From your opinion, it should only be used as a comparable object (opaque),
which represents the position of a message [1]. What I have thought is,
MessageId should be a wrapper of the MessageIdData in PulsarApi.proto [2].

I agree that at application side, there are not many cases that need
to look into
the details of a specific MessageId implementation. However, these
"internal fields"
are the de-facto concepts in Pulsar. Yeah, ledger id and entry id are
concepts from
BK and they might change. I doubt if there is any Pulsar application
developer that
doesn't know these two concepts?

Let's look at the clients of other languages, including C++[3],
Golang[4], Python[5],
Rust[6][7], C#[8]. Only the official Java client doesn't expose these fields.

I'm also okay to not change the MessageId interface in the
pulsar-client-api module.
Because the motivation is to simplify the messed implementations of
MessageId in the
pulsar-client module.

We can add a MessageIdPojo class in the pulsar-common module or else.
In the Pulsar
repo, all MessageId implementations must extend the MessageIdPojo class.

```java
@Data
class MessageIdPojo extends MessageId {
    private final long ledgerId;
    // ...
}
```

[1] https://pulsar.apache.org/docs/concepts-messaging#messages
[2] https://github.com/apache/pulsar/blob/b31c5a6a325728b5dc5faebd1a33386952d733d5/pulsar-common/src/main/proto/PulsarApi.proto#L57
[3] https://github.com/apache/pulsar-client-cpp/blob/main/include/pulsar/MessageId.h
[4] https://github.com/apache/pulsar-client-go/blob/d9c8b0ab9c14e8d571b632c93002ea20db1a2c16/pulsar/message.go#L147
[5] https://github.com/apache/pulsar-client-python/blob/75a57b427d4c6944c49f4b712344107b5444aa36/pulsar/__init__.py#L84
[6] https://github.com/streamnative/pulsar-rs/blob/de59974080daa248bfdeaea8510eb72ec8f30bac/src/consumer.rs#L1984
[7] https://github.com/streamnative/pulsar-rs/blob/de59974080daa248bfdeaea8510eb72ec8f30bac/src/consumer.rs#L1353
[8] https://github.com/apache/pulsar-dotpulsar/blob/0590b1ad6c4474d425662352ba62abb41bfb9f0a/src/DotPulsar/MessageId.cs#L56

Thanks,
Yunze

On Wed, Nov 9, 2022 at 8:24 PM Joe F <jo...@gmail.com> wrote:
>
> Messageid is an identifier which identifies a message.  How that id is
> constructed, or what it contains should not  matter to an application,  and
> an application should not assume anything about the implementation of that
> id.
>
> >What about the partition index? We have a `TopicMetadata` interface that
> returns the number of partitions.
>
> Partitioning is a first class concept, and is  designed to be used by
> application.  How a partition is implemented  should not be used by the
> application .
>
>  [ People violate this all the time, and I regret that Pulsar did not
> provide get_Nth_topicpartion(), which led to people hardcoding it  as
> topicname-N. and using that directly.  Now we are stuck with it.]
>
>  Similarly batch index and batch size. Those are all logical concepts
> exposed to the user.  For eg: batch size is something the app is allowed to
> tune
>
> >Even for ledger id and entry id, this pair represents a logic storage
> position like the offset concept in Kafka
> These are not equivalent.   In Pulsar these are implementation details,
> while in Kafka those are logical concepts.
>
> One might think that these are logical concepts in Pulsar, because if you
> reverse engineer the current msgid implementation, you observe some
> "properties".
>
> Ledger id/entry id are logical concepts in __Bookkeeper__ , not  in Pulsar.
> There is the Managed Ledger abstraction on top of BK, and then there is
> Pulsar on top of ML. You will break two levels of abstraction to expose
> ledger/entryid to an application
>
> An application  should only care about the  operations that  can be done
> with a messageId
>
> - getmsgid() to return the message id  as an opaque object
>
> [Operators   using  one messageId ]
> -serde,   like tostring(). for storage/retrieval of message identifier
> -getter/setter on logical properties of the message (partition id etc...)
> -increment/decrement
>
> [Operators that take multiple messageIds]
> -comparator
> -range
>
> Those are the kind of operators Pulsar should provide to a user.
> Applications should not implement these operators on their own by reverse
> engineering the msgId. No application should be directly using ledgerid or
> entryid for doing anything (math or logic),
>
>   As long as Pulsar provides  these operations  with msgid to the
> application,  it should not care whether it's represented as "0:1:-1:-1"
> or  "a:b:-b-b", or   "#xba4231!haxcy1826923f" or as a serialized binary
> object or..whatever it may be.
>
> >>But it would be harder to know a tuple like "0:1:-1:-1" means.
>
> A user shouldn't have to know what this means. That's the point.
>
> Pulsar itself changed the messageId multiple times as it added
> partitioning, batching and so on, and it might do so again. And bookkeeper
> could change its representation of  ledgers, (for eg,  to uuids and byte
> offsets)  ML could replace BK with something else  (for eg.  a table in a
> db.)  Anything is possible - Pulsar would then just have to change the
> implementation of the operator functions, and no application needs to be
> rewritten.
>
> -j
>
> On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yz...@streamnative.io.invalid>
> wrote:
>
> > Hi Joe,
> >
> > Then what would we expect users to do with the MessageId? It should only
> > be passed to Consumer#seek or ReaderBuilder#startMessageId?
> >
> > What about the partition index? We have a `TopicMetadata` interface that
> > returns
> > the number of partitions. If the partition is also "implementation
> > details", should we expose
> > this interface? Or should we support customizing a MessageRouter because it
> > returns the partition index?
> >
> > What about the batch index and batch size? For example, we have an
> > enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
> > index is also "implementation details", how could users know what does
> > "batch
> > index ack" mean?
> >
> > Even for ledger id and entry id, this pair represents a logic storage
> > position like the offset
> > concept in Kafka (though each offset represents a message while each
> > entry represents
> > a batch). If you see the Message API, it also exposes many attributes.
> > IMO, for the
> > MessageIdData, only the ack_set (a long array serialized from the
> > BitSet) is the implementation
> > detail.
> >
> > The MessageId API should be flexible, not an abstract one. If not, why
> > do we still implement
> > the toString() method? We should not encourage users to print the
> > MessageId. It would
> > be easy to know what "ledger is 0, entry id is 1" means, users only
> > need to know the concepts
> > of ledger id and entry id. But it would be harder to know a tuple like
> > "0:1:-1:-1" means.
> >
> > Thanks,
> > Yunze
> >
> > On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
> > >
> > > >Maybe this design is to hidden some details, but if
> > > users don't know the details like ledger id and entry id, how could
> > > you know what does "0:0:-1:0" mean?
> > >
> > >  Abstractions exist for a reason. Ledgerid and entryid are
> > implementation
> > > details, and an application should not be interpreting that at all.
> > > -j
> > >
> > >
> > > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
> > > wrote:
> > >
> > > > I didn't look into these two methods at the moment. But I think it's
> > > > possible to
> > > > retain only the `fromByteArray`.
> > > >
> > > > Thanks,
> > > > Yunze
> > > >
> > > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> > > > wrote:
> > > > >
> > > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > >
> > > > > > Hi Enrico,
> > > > > >
> > > > > > > We also need a way to represent this as a String or a byte[]
> > > > > >
> > > > > > We already have the `toByteArray` method, right?
> > > > >
> > > > > Yes, correct. So we are fine. I forgot about it and I answered too
> > > > quickly.
> > > > >
> > > > > I am not sure if this can be in the scope of this initiative, but we
> > > > > should somehow get rid of
> > > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > > >
> > > > > Thanks
> > > > > Enrico
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Yunze
> > > > > >
> > > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> > eolivelli@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > > >
> > > > > > > > Hi all,
> > > > > > > >
> > > > > > > > Currently we have the following 5 implementations of MessageId:
> > > > > > > >
> > > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size, acker),
> > > > where
> > > > > > > >     acker is a wrapper of a BitSet.
> > > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> > represents
> > > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name to
> > the
> > > > > > > >     MessageId.
> > > > > > > > - TopicMessageIdImpl: adds the topic name and the partition
> > name
> > > > > > > >
> > > > > > > > These implementations are such a mess. For example, when users
> > get
> > > > a
> > > > > > > > MessageId from `Producer#send`:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > var id = producer.send("msg");
> > > > > > > > ```
> > > > > > > >
> > > > > > > > There is no getter to get some specific fields like ledger id.
> > You
> > > > can
> > > > > > > > only see a representation from `toString` method and got some
> > > > output
> > > > > > > > like "0:0:-1:0". Maybe this design is to hidden some details,
> > but
> > > > if
> > > > > > > > users don't know the details like ledger id and entry id, how
> > could
> > > > > > > > you know what does "0:0:-1:0" mean? What if
> > `MessageId#toString`'s
> > > > > > > > implementation changed? Should it be treated as a breaking
> > change?
> > > > > > > >
> > > > > > > > The original definition of the underlying MessageIdData is much
> > > > more
> > > > > > > > clear:
> > > > > > > >
> > > > > > > > ```proto
> > > > > > > > message MessageIdData {
> > > > > > > >     required uint64 ledgerId = 1;
> > > > > > > >     required uint64 entryId  = 2;
> > > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > > >     repeated int64 ack_set = 5;
> > > > > > > >     optional int32 batch_size = 6;
> > > > > > > >
> > > > > > > >     // For the chunk message id, we need to specify the first
> > > > chunk message id.
> > > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > > }
> > > > > > > > ```
> > > > > > > >
> > > > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's more
> > > > natural
> > > > > > > > to have an interface like:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > interface MessageId {
> > > > > > > >     long ledgerId();
> > > > > > > >     long entryId();
> > > > > > > >     Optional<Integer> partition();
> > > > > > > >     Optional<Integer> batchIndex();
> > > > > > > >     // ...
> > > > > > > > ```
> > > > > > >
> > > > > > > This is very good for client applications.
> > > > > > > We also need a way to represent this as a String or a byte[],
> > this
> > > > way
> > > > > > > client applications have a standard way to store
> > > > > > > message offsets into an external system (for instance when you
> > want
> > > > to
> > > > > > > user the Reader API and keep track of the position by yourself)
> > > > > > >
> > > > > > > Enrico
> > > > > > >
> > > > > > > >
> > > > > > > > Additionally, there are many places that use only the triple of
> > > > > > > > (ledger id, entry id, batch index) as the key to represent the
> > > > position.
> > > > > > > > Currently, they are done by adding a conversion from
> > > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> > intuitive
> > > > to
> > > > > > > > write something like:
> > > > > > > >
> > > > > > > > ```java
> > > > > > > > class MessageIdPosition implements
> > Comparable<MessageIdPosition> {
> > > > > > > >     private final MessageId messageId;
> > > > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > > > ```
> > > > > > > >
> > > > > > > > Therefore, I'm going to write a proposal to redesign the
> > MessageId
> > > > > > > > interface only by adding some getters. Regarding the 5 existing
> > > > > > > > implementations, I think we can drop them because they are a
> > part
> > > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > > >
> > > > > > > > Please feel free to share your points.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yunze
> > > >
> >

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Joe F <jo...@gmail.com>.
Messageid is an identifier which identifies a message.  How that id is
constructed, or what it contains should not  matter to an application,  and
an application should not assume anything about the implementation of that
id.

>What about the partition index? We have a `TopicMetadata` interface that
returns the number of partitions.

Partitioning is a first class concept, and is  designed to be used by
application.  How a partition is implemented  should not be used by the
application .

 [ People violate this all the time, and I regret that Pulsar did not
provide get_Nth_topicpartion(), which led to people hardcoding it  as
topicname-N. and using that directly.  Now we are stuck with it.]

 Similarly batch index and batch size. Those are all logical concepts
exposed to the user.  For eg: batch size is something the app is allowed to
tune

>Even for ledger id and entry id, this pair represents a logic storage
position like the offset concept in Kafka
These are not equivalent.   In Pulsar these are implementation details,
while in Kafka those are logical concepts.

One might think that these are logical concepts in Pulsar, because if you
reverse engineer the current msgid implementation, you observe some
"properties".

Ledger id/entry id are logical concepts in __Bookkeeper__ , not  in Pulsar.
There is the Managed Ledger abstraction on top of BK, and then there is
Pulsar on top of ML. You will break two levels of abstraction to expose
ledger/entryid to an application

An application  should only care about the  operations that  can be done
with a messageId

- getmsgid() to return the message id  as an opaque object

[Operators   using  one messageId ]
-serde,   like tostring(). for storage/retrieval of message identifier
-getter/setter on logical properties of the message (partition id etc...)
-increment/decrement

[Operators that take multiple messageIds]
-comparator
-range

Those are the kind of operators Pulsar should provide to a user.
Applications should not implement these operators on their own by reverse
engineering the msgId. No application should be directly using ledgerid or
entryid for doing anything (math or logic),

  As long as Pulsar provides  these operations  with msgid to the
application,  it should not care whether it's represented as "0:1:-1:-1"
or  "a:b:-b-b", or   "#xba4231!haxcy1826923f" or as a serialized binary
object or..whatever it may be.

>>But it would be harder to know a tuple like "0:1:-1:-1" means.

A user shouldn't have to know what this means. That's the point.

Pulsar itself changed the messageId multiple times as it added
partitioning, batching and so on, and it might do so again. And bookkeeper
could change its representation of  ledgers, (for eg,  to uuids and byte
offsets)  ML could replace BK with something else  (for eg.  a table in a
db.)  Anything is possible - Pulsar would then just have to change the
implementation of the operator functions, and no application needs to be
rewritten.

-j

On Tue, Nov 8, 2022 at 6:05 PM Yunze Xu <yz...@streamnative.io.invalid>
wrote:

> Hi Joe,
>
> Then what would we expect users to do with the MessageId? It should only
> be passed to Consumer#seek or ReaderBuilder#startMessageId?
>
> What about the partition index? We have a `TopicMetadata` interface that
> returns
> the number of partitions. If the partition is also "implementation
> details", should we expose
> this interface? Or should we support customizing a MessageRouter because it
> returns the partition index?
>
> What about the batch index and batch size? For example, we have an
> enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
> index is also "implementation details", how could users know what does
> "batch
> index ack" mean?
>
> Even for ledger id and entry id, this pair represents a logic storage
> position like the offset
> concept in Kafka (though each offset represents a message while each
> entry represents
> a batch). If you see the Message API, it also exposes many attributes.
> IMO, for the
> MessageIdData, only the ack_set (a long array serialized from the
> BitSet) is the implementation
> detail.
>
> The MessageId API should be flexible, not an abstract one. If not, why
> do we still implement
> the toString() method? We should not encourage users to print the
> MessageId. It would
> be easy to know what "ledger is 0, entry id is 1" means, users only
> need to know the concepts
> of ledger id and entry id. But it would be harder to know a tuple like
> "0:1:-1:-1" means.
>
> Thanks,
> Yunze
>
> On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
> >
> > >Maybe this design is to hidden some details, but if
> > users don't know the details like ledger id and entry id, how could
> > you know what does "0:0:-1:0" mean?
> >
> >  Abstractions exist for a reason. Ledgerid and entryid are
> implementation
> > details, and an application should not be interpreting that at all.
> > -j
> >
> >
> > On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
> > wrote:
> >
> > > I didn't look into these two methods at the moment. But I think it's
> > > possible to
> > > retain only the `fromByteArray`.
> > >
> > > Thanks,
> > > Yunze
> > >
> > > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> > > wrote:
> > > >
> > > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > >
> > > > > Hi Enrico,
> > > > >
> > > > > > We also need a way to represent this as a String or a byte[]
> > > > >
> > > > > We already have the `toByteArray` method, right?
> > > >
> > > > Yes, correct. So we are fine. I forgot about it and I answered too
> > > quickly.
> > > >
> > > > I am not sure if this can be in the scope of this initiative, but we
> > > > should somehow get rid of
> > > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > > >
> > > > Thanks
> > > > Enrico
> > > >
> > > > >
> > > > > Thanks,
> > > > > Yunze
> > > > >
> > > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <
> eolivelli@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Currently we have the following 5 implementations of MessageId:
> > > > > > >
> > > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > > >   - BatchMessageIdImpl: adds (batch index, batch size, acker),
> > > where
> > > > > > >     acker is a wrapper of a BitSet.
> > > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that
> represents
> > > > > > >     the first MessageIdImpl of a BitSet.
> > > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name to
> the
> > > > > > >     MessageId.
> > > > > > > - TopicMessageIdImpl: adds the topic name and the partition
> name
> > > > > > >
> > > > > > > These implementations are such a mess. For example, when users
> get
> > > a
> > > > > > > MessageId from `Producer#send`:
> > > > > > >
> > > > > > > ```java
> > > > > > > var id = producer.send("msg");
> > > > > > > ```
> > > > > > >
> > > > > > > There is no getter to get some specific fields like ledger id.
> You
> > > can
> > > > > > > only see a representation from `toString` method and got some
> > > output
> > > > > > > like "0:0:-1:0". Maybe this design is to hidden some details,
> but
> > > if
> > > > > > > users don't know the details like ledger id and entry id, how
> could
> > > > > > > you know what does "0:0:-1:0" mean? What if
> `MessageId#toString`'s
> > > > > > > implementation changed? Should it be treated as a breaking
> change?
> > > > > > >
> > > > > > > The original definition of the underlying MessageIdData is much
> > > more
> > > > > > > clear:
> > > > > > >
> > > > > > > ```proto
> > > > > > > message MessageIdData {
> > > > > > >     required uint64 ledgerId = 1;
> > > > > > >     required uint64 entryId  = 2;
> > > > > > >     optional int32 partition = 3 [default = -1];
> > > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > > >     repeated int64 ack_set = 5;
> > > > > > >     optional int32 batch_size = 6;
> > > > > > >
> > > > > > >     // For the chunk message id, we need to specify the first
> > > chunk message id.
> > > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > > }
> > > > > > > ```
> > > > > > >
> > > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's more
> > > natural
> > > > > > > to have an interface like:
> > > > > > >
> > > > > > > ```java
> > > > > > > interface MessageId {
> > > > > > >     long ledgerId();
> > > > > > >     long entryId();
> > > > > > >     Optional<Integer> partition();
> > > > > > >     Optional<Integer> batchIndex();
> > > > > > >     // ...
> > > > > > > ```
> > > > > >
> > > > > > This is very good for client applications.
> > > > > > We also need a way to represent this as a String or a byte[],
> this
> > > way
> > > > > > client applications have a standard way to store
> > > > > > message offsets into an external system (for instance when you
> want
> > > to
> > > > > > user the Reader API and keep track of the position by yourself)
> > > > > >
> > > > > > Enrico
> > > > > >
> > > > > > >
> > > > > > > Additionally, there are many places that use only the triple of
> > > > > > > (ledger id, entry id, batch index) as the key to represent the
> > > position.
> > > > > > > Currently, they are done by adding a conversion from
> > > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more
> intuitive
> > > to
> > > > > > > write something like:
> > > > > > >
> > > > > > > ```java
> > > > > > > class MessageIdPosition implements
> Comparable<MessageIdPosition> {
> > > > > > >     private final MessageId messageId;
> > > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > > ```
> > > > > > >
> > > > > > > Therefore, I'm going to write a proposal to redesign the
> MessageId
> > > > > > > interface only by adding some getters. Regarding the 5 existing
> > > > > > > implementations, I think we can drop them because they are a
> part
> > > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > > >
> > > > > > > Please feel free to share your points.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yunze
> > >
>

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Yunze Xu <yz...@streamnative.io.INVALID>.
Hi Joe,

Then what would we expect users to do with the MessageId? It should only
be passed to Consumer#seek or ReaderBuilder#startMessageId?

What about the partition index? We have a `TopicMetadata` interface that returns
the number of partitions. If the partition is also "implementation
details", should we expose
this interface? Or should we support customizing a MessageRouter because it
returns the partition index?

What about the batch index and batch size? For example, we have an
enableBatchIndexAcknowledgment method to enable batch index ACK. If batch
index is also "implementation details", how could users know what does "batch
index ack" mean?

Even for ledger id and entry id, this pair represents a logic storage
position like the offset
concept in Kafka (though each offset represents a message while each
entry represents
a batch). If you see the Message API, it also exposes many attributes.
IMO, for the
MessageIdData, only the ack_set (a long array serialized from the
BitSet) is the implementation
detail.

The MessageId API should be flexible, not an abstract one. If not, why
do we still implement
the toString() method? We should not encourage users to print the
MessageId. It would
be easy to know what "ledger is 0, entry id is 1" means, users only
need to know the concepts
of ledger id and entry id. But it would be harder to know a tuple like
"0:1:-1:-1" means.

Thanks,
Yunze

On Tue, Nov 8, 2022 at 11:16 PM Joe F <jo...@gmail.com> wrote:
>
> >Maybe this design is to hidden some details, but if
> users don't know the details like ledger id and entry id, how could
> you know what does "0:0:-1:0" mean?
>
>  Abstractions exist for a reason. Ledgerid and entryid are  implementation
> details, and an application should not be interpreting that at all.
> -j
>
>
> On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
> wrote:
>
> > I didn't look into these two methods at the moment. But I think it's
> > possible to
> > retain only the `fromByteArray`.
> >
> > Thanks,
> > Yunze
> >
> > On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> > wrote:
> > >
> > > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > > <yz...@streamnative.io.invalid> ha scritto:
> > > >
> > > > Hi Enrico,
> > > >
> > > > > We also need a way to represent this as a String or a byte[]
> > > >
> > > > We already have the `toByteArray` method, right?
> > >
> > > Yes, correct. So we are fine. I forgot about it and I answered too
> > quickly.
> > >
> > > I am not sure if this can be in the scope of this initiative, but we
> > > should somehow get rid of
> > > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> > >
> > > Thanks
> > > Enrico
> > >
> > > >
> > > > Thanks,
> > > > Yunze
> > > >
> > > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <eo...@gmail.com>
> > wrote:
> > > > >
> > > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Currently we have the following 5 implementations of MessageId:
> > > > > >
> > > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > > >   - BatchMessageIdImpl: adds (batch index, batch size, acker),
> > where
> > > > > >     acker is a wrapper of a BitSet.
> > > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that represents
> > > > > >     the first MessageIdImpl of a BitSet.
> > > > > >   - MultiMessageIdImpl: adds a map that maps the topic name to the
> > > > > >     MessageId.
> > > > > > - TopicMessageIdImpl: adds the topic name and the partition name
> > > > > >
> > > > > > These implementations are such a mess. For example, when users get
> > a
> > > > > > MessageId from `Producer#send`:
> > > > > >
> > > > > > ```java
> > > > > > var id = producer.send("msg");
> > > > > > ```
> > > > > >
> > > > > > There is no getter to get some specific fields like ledger id. You
> > can
> > > > > > only see a representation from `toString` method and got some
> > output
> > > > > > like "0:0:-1:0". Maybe this design is to hidden some details, but
> > if
> > > > > > users don't know the details like ledger id and entry id, how could
> > > > > > you know what does "0:0:-1:0" mean? What if `MessageId#toString`'s
> > > > > > implementation changed? Should it be treated as a breaking change?
> > > > > >
> > > > > > The original definition of the underlying MessageIdData is much
> > more
> > > > > > clear:
> > > > > >
> > > > > > ```proto
> > > > > > message MessageIdData {
> > > > > >     required uint64 ledgerId = 1;
> > > > > >     required uint64 entryId  = 2;
> > > > > >     optional int32 partition = 3 [default = -1];
> > > > > >     optional int32 batch_index = 4 [default = -1];
> > > > > >     repeated int64 ack_set = 5;
> > > > > >     optional int32 batch_size = 6;
> > > > > >
> > > > > >     // For the chunk message id, we need to specify the first
> > chunk message id.
> > > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > > }
> > > > > > ```
> > > > > >
> > > > > > IMO, MessageId should be a wrapper of MessageIdData. It's more
> > natural
> > > > > > to have an interface like:
> > > > > >
> > > > > > ```java
> > > > > > interface MessageId {
> > > > > >     long ledgerId();
> > > > > >     long entryId();
> > > > > >     Optional<Integer> partition();
> > > > > >     Optional<Integer> batchIndex();
> > > > > >     // ...
> > > > > > ```
> > > > >
> > > > > This is very good for client applications.
> > > > > We also need a way to represent this as a String or a byte[], this
> > way
> > > > > client applications have a standard way to store
> > > > > message offsets into an external system (for instance when you want
> > to
> > > > > user the Reader API and keep track of the position by yourself)
> > > > >
> > > > > Enrico
> > > > >
> > > > > >
> > > > > > Additionally, there are many places that use only the triple of
> > > > > > (ledger id, entry id, batch index) as the key to represent the
> > position.
> > > > > > Currently, they are done by adding a conversion from
> > > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more intuitive
> > to
> > > > > > write something like:
> > > > > >
> > > > > > ```java
> > > > > > class MessageIdPosition implements Comparable<MessageIdPosition> {
> > > > > >     private final MessageId messageId;
> > > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > > ```
> > > > > >
> > > > > > Therefore, I'm going to write a proposal to redesign the MessageId
> > > > > > interface only by adding some getters. Regarding the 5 existing
> > > > > > implementations, I think we can drop them because they are a part
> > > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > > >
> > > > > > Please feel free to share your points.
> > > > > >
> > > > > > Thanks,
> > > > > > Yunze
> >

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Joe F <jo...@gmail.com>.
>Maybe this design is to hidden some details, but if
users don't know the details like ledger id and entry id, how could
you know what does "0:0:-1:0" mean?

 Abstractions exist for a reason. Ledgerid and entryid are  implementation
details, and an application should not be interpreting that at all.
-j


On Tue, Nov 8, 2022 at 3:43 AM Yunze Xu <yz...@streamnative.io.invalid>
wrote:

> I didn't look into these two methods at the moment. But I think it's
> possible to
> retain only the `fromByteArray`.
>
> Thanks,
> Yunze
>
> On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com>
> wrote:
> >
> > Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> > <yz...@streamnative.io.invalid> ha scritto:
> > >
> > > Hi Enrico,
> > >
> > > > We also need a way to represent this as a String or a byte[]
> > >
> > > We already have the `toByteArray` method, right?
> >
> > Yes, correct. So we are fine. I forgot about it and I answered too
> quickly.
> >
> > I am not sure if this can be in the scope of this initiative, but we
> > should somehow get rid of
> > stuff like "fromByteArrayWithTopic" vs "fromByteArray".
> >
> > Thanks
> > Enrico
> >
> > >
> > > Thanks,
> > > Yunze
> > >
> > > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <eo...@gmail.com>
> wrote:
> > > >
> > > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > > <yz...@streamnative.io.invalid> ha scritto:
> > > > >
> > > > > Hi all,
> > > > >
> > > > > Currently we have the following 5 implementations of MessageId:
> > > > >
> > > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > > >   - BatchMessageIdImpl: adds (batch index, batch size, acker),
> where
> > > > >     acker is a wrapper of a BitSet.
> > > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that represents
> > > > >     the first MessageIdImpl of a BitSet.
> > > > >   - MultiMessageIdImpl: adds a map that maps the topic name to the
> > > > >     MessageId.
> > > > > - TopicMessageIdImpl: adds the topic name and the partition name
> > > > >
> > > > > These implementations are such a mess. For example, when users get
> a
> > > > > MessageId from `Producer#send`:
> > > > >
> > > > > ```java
> > > > > var id = producer.send("msg");
> > > > > ```
> > > > >
> > > > > There is no getter to get some specific fields like ledger id. You
> can
> > > > > only see a representation from `toString` method and got some
> output
> > > > > like "0:0:-1:0". Maybe this design is to hidden some details, but
> if
> > > > > users don't know the details like ledger id and entry id, how could
> > > > > you know what does "0:0:-1:0" mean? What if `MessageId#toString`'s
> > > > > implementation changed? Should it be treated as a breaking change?
> > > > >
> > > > > The original definition of the underlying MessageIdData is much
> more
> > > > > clear:
> > > > >
> > > > > ```proto
> > > > > message MessageIdData {
> > > > >     required uint64 ledgerId = 1;
> > > > >     required uint64 entryId  = 2;
> > > > >     optional int32 partition = 3 [default = -1];
> > > > >     optional int32 batch_index = 4 [default = -1];
> > > > >     repeated int64 ack_set = 5;
> > > > >     optional int32 batch_size = 6;
> > > > >
> > > > >     // For the chunk message id, we need to specify the first
> chunk message id.
> > > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > > }
> > > > > ```
> > > > >
> > > > > IMO, MessageId should be a wrapper of MessageIdData. It's more
> natural
> > > > > to have an interface like:
> > > > >
> > > > > ```java
> > > > > interface MessageId {
> > > > >     long ledgerId();
> > > > >     long entryId();
> > > > >     Optional<Integer> partition();
> > > > >     Optional<Integer> batchIndex();
> > > > >     // ...
> > > > > ```
> > > >
> > > > This is very good for client applications.
> > > > We also need a way to represent this as a String or a byte[], this
> way
> > > > client applications have a standard way to store
> > > > message offsets into an external system (for instance when you want
> to
> > > > user the Reader API and keep track of the position by yourself)
> > > >
> > > > Enrico
> > > >
> > > > >
> > > > > Additionally, there are many places that use only the triple of
> > > > > (ledger id, entry id, batch index) as the key to represent the
> position.
> > > > > Currently, they are done by adding a conversion from
> > > > > BatchMessageIdImpl to MessageIdImpl. However, it's more intuitive
> to
> > > > > write something like:
> > > > >
> > > > > ```java
> > > > > class MessageIdPosition implements Comparable<MessageIdPosition> {
> > > > >     private final MessageId messageId;
> > > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > > ```
> > > > >
> > > > > Therefore, I'm going to write a proposal to redesign the MessageId
> > > > > interface only by adding some getters. Regarding the 5 existing
> > > > > implementations, I think we can drop them because they are a part
> > > > > of `pulsar-client`, not `pulsar-client-api`.
> > > > >
> > > > > Please feel free to share your points.
> > > > >
> > > > > Thanks,
> > > > > Yunze
>

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Yunze Xu <yz...@streamnative.io.INVALID>.
I didn't look into these two methods at the moment. But I think it's possible to
retain only the `fromByteArray`.

Thanks,
Yunze

On Tue, Nov 8, 2022 at 7:02 PM Enrico Olivelli <eo...@gmail.com> wrote:
>
> Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
> <yz...@streamnative.io.invalid> ha scritto:
> >
> > Hi Enrico,
> >
> > > We also need a way to represent this as a String or a byte[]
> >
> > We already have the `toByteArray` method, right?
>
> Yes, correct. So we are fine. I forgot about it and I answered too quickly.
>
> I am not sure if this can be in the scope of this initiative, but we
> should somehow get rid of
> stuff like "fromByteArrayWithTopic" vs "fromByteArray".
>
> Thanks
> Enrico
>
> >
> > Thanks,
> > Yunze
> >
> > On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <eo...@gmail.com> wrote:
> > >
> > > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > > <yz...@streamnative.io.invalid> ha scritto:
> > > >
> > > > Hi all,
> > > >
> > > > Currently we have the following 5 implementations of MessageId:
> > > >
> > > > - MessageIdImpl: (ledger id, entry id, partition index)
> > > >   - BatchMessageIdImpl: adds (batch index, batch size, acker), where
> > > >     acker is a wrapper of a BitSet.
> > > >   - ChunkMessageIdImpl: adds another MessageIdImpl that represents
> > > >     the first MessageIdImpl of a BitSet.
> > > >   - MultiMessageIdImpl: adds a map that maps the topic name to the
> > > >     MessageId.
> > > > - TopicMessageIdImpl: adds the topic name and the partition name
> > > >
> > > > These implementations are such a mess. For example, when users get a
> > > > MessageId from `Producer#send`:
> > > >
> > > > ```java
> > > > var id = producer.send("msg");
> > > > ```
> > > >
> > > > There is no getter to get some specific fields like ledger id. You can
> > > > only see a representation from `toString` method and got some output
> > > > like "0:0:-1:0". Maybe this design is to hidden some details, but if
> > > > users don't know the details like ledger id and entry id, how could
> > > > you know what does "0:0:-1:0" mean? What if `MessageId#toString`'s
> > > > implementation changed? Should it be treated as a breaking change?
> > > >
> > > > The original definition of the underlying MessageIdData is much more
> > > > clear:
> > > >
> > > > ```proto
> > > > message MessageIdData {
> > > >     required uint64 ledgerId = 1;
> > > >     required uint64 entryId  = 2;
> > > >     optional int32 partition = 3 [default = -1];
> > > >     optional int32 batch_index = 4 [default = -1];
> > > >     repeated int64 ack_set = 5;
> > > >     optional int32 batch_size = 6;
> > > >
> > > >     // For the chunk message id, we need to specify the first chunk message id.
> > > >     optional MessageIdData first_chunk_message_id = 7;
> > > > }
> > > > ```
> > > >
> > > > IMO, MessageId should be a wrapper of MessageIdData. It's more natural
> > > > to have an interface like:
> > > >
> > > > ```java
> > > > interface MessageId {
> > > >     long ledgerId();
> > > >     long entryId();
> > > >     Optional<Integer> partition();
> > > >     Optional<Integer> batchIndex();
> > > >     // ...
> > > > ```
> > >
> > > This is very good for client applications.
> > > We also need a way to represent this as a String or a byte[], this way
> > > client applications have a standard way to store
> > > message offsets into an external system (for instance when you want to
> > > user the Reader API and keep track of the position by yourself)
> > >
> > > Enrico
> > >
> > > >
> > > > Additionally, there are many places that use only the triple of
> > > > (ledger id, entry id, batch index) as the key to represent the position.
> > > > Currently, they are done by adding a conversion from
> > > > BatchMessageIdImpl to MessageIdImpl. However, it's more intuitive to
> > > > write something like:
> > > >
> > > > ```java
> > > > class MessageIdPosition implements Comparable<MessageIdPosition> {
> > > >     private final MessageId messageId;
> > > >     // TODO: compare only the triple (ledger, entry, batch)
> > > > ```
> > > >
> > > > Therefore, I'm going to write a proposal to redesign the MessageId
> > > > interface only by adding some getters. Regarding the 5 existing
> > > > implementations, I think we can drop them because they are a part
> > > > of `pulsar-client`, not `pulsar-client-api`.
> > > >
> > > > Please feel free to share your points.
> > > >
> > > > Thanks,
> > > > Yunze

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Enrico Olivelli <eo...@gmail.com>.
Il giorno mar 8 nov 2022 alle ore 11:52 Yunze Xu
<yz...@streamnative.io.invalid> ha scritto:
>
> Hi Enrico,
>
> > We also need a way to represent this as a String or a byte[]
>
> We already have the `toByteArray` method, right?

Yes, correct. So we are fine. I forgot about it and I answered too quickly.

I am not sure if this can be in the scope of this initiative, but we
should somehow get rid of
stuff like "fromByteArrayWithTopic" vs "fromByteArray".

Thanks
Enrico

>
> Thanks,
> Yunze
>
> On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <eo...@gmail.com> wrote:
> >
> > Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> > <yz...@streamnative.io.invalid> ha scritto:
> > >
> > > Hi all,
> > >
> > > Currently we have the following 5 implementations of MessageId:
> > >
> > > - MessageIdImpl: (ledger id, entry id, partition index)
> > >   - BatchMessageIdImpl: adds (batch index, batch size, acker), where
> > >     acker is a wrapper of a BitSet.
> > >   - ChunkMessageIdImpl: adds another MessageIdImpl that represents
> > >     the first MessageIdImpl of a BitSet.
> > >   - MultiMessageIdImpl: adds a map that maps the topic name to the
> > >     MessageId.
> > > - TopicMessageIdImpl: adds the topic name and the partition name
> > >
> > > These implementations are such a mess. For example, when users get a
> > > MessageId from `Producer#send`:
> > >
> > > ```java
> > > var id = producer.send("msg");
> > > ```
> > >
> > > There is no getter to get some specific fields like ledger id. You can
> > > only see a representation from `toString` method and got some output
> > > like "0:0:-1:0". Maybe this design is to hidden some details, but if
> > > users don't know the details like ledger id and entry id, how could
> > > you know what does "0:0:-1:0" mean? What if `MessageId#toString`'s
> > > implementation changed? Should it be treated as a breaking change?
> > >
> > > The original definition of the underlying MessageIdData is much more
> > > clear:
> > >
> > > ```proto
> > > message MessageIdData {
> > >     required uint64 ledgerId = 1;
> > >     required uint64 entryId  = 2;
> > >     optional int32 partition = 3 [default = -1];
> > >     optional int32 batch_index = 4 [default = -1];
> > >     repeated int64 ack_set = 5;
> > >     optional int32 batch_size = 6;
> > >
> > >     // For the chunk message id, we need to specify the first chunk message id.
> > >     optional MessageIdData first_chunk_message_id = 7;
> > > }
> > > ```
> > >
> > > IMO, MessageId should be a wrapper of MessageIdData. It's more natural
> > > to have an interface like:
> > >
> > > ```java
> > > interface MessageId {
> > >     long ledgerId();
> > >     long entryId();
> > >     Optional<Integer> partition();
> > >     Optional<Integer> batchIndex();
> > >     // ...
> > > ```
> >
> > This is very good for client applications.
> > We also need a way to represent this as a String or a byte[], this way
> > client applications have a standard way to store
> > message offsets into an external system (for instance when you want to
> > user the Reader API and keep track of the position by yourself)
> >
> > Enrico
> >
> > >
> > > Additionally, there are many places that use only the triple of
> > > (ledger id, entry id, batch index) as the key to represent the position.
> > > Currently, they are done by adding a conversion from
> > > BatchMessageIdImpl to MessageIdImpl. However, it's more intuitive to
> > > write something like:
> > >
> > > ```java
> > > class MessageIdPosition implements Comparable<MessageIdPosition> {
> > >     private final MessageId messageId;
> > >     // TODO: compare only the triple (ledger, entry, batch)
> > > ```
> > >
> > > Therefore, I'm going to write a proposal to redesign the MessageId
> > > interface only by adding some getters. Regarding the 5 existing
> > > implementations, I think we can drop them because they are a part
> > > of `pulsar-client`, not `pulsar-client-api`.
> > >
> > > Please feel free to share your points.
> > >
> > > Thanks,
> > > Yunze

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Yunze Xu <yz...@streamnative.io.INVALID>.
Hi Enrico,

> We also need a way to represent this as a String or a byte[]

We already have the `toByteArray` method, right?

Thanks,
Yunze

On Tue, Nov 8, 2022 at 6:43 PM Enrico Olivelli <eo...@gmail.com> wrote:
>
> Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
> <yz...@streamnative.io.invalid> ha scritto:
> >
> > Hi all,
> >
> > Currently we have the following 5 implementations of MessageId:
> >
> > - MessageIdImpl: (ledger id, entry id, partition index)
> >   - BatchMessageIdImpl: adds (batch index, batch size, acker), where
> >     acker is a wrapper of a BitSet.
> >   - ChunkMessageIdImpl: adds another MessageIdImpl that represents
> >     the first MessageIdImpl of a BitSet.
> >   - MultiMessageIdImpl: adds a map that maps the topic name to the
> >     MessageId.
> > - TopicMessageIdImpl: adds the topic name and the partition name
> >
> > These implementations are such a mess. For example, when users get a
> > MessageId from `Producer#send`:
> >
> > ```java
> > var id = producer.send("msg");
> > ```
> >
> > There is no getter to get some specific fields like ledger id. You can
> > only see a representation from `toString` method and got some output
> > like "0:0:-1:0". Maybe this design is to hidden some details, but if
> > users don't know the details like ledger id and entry id, how could
> > you know what does "0:0:-1:0" mean? What if `MessageId#toString`'s
> > implementation changed? Should it be treated as a breaking change?
> >
> > The original definition of the underlying MessageIdData is much more
> > clear:
> >
> > ```proto
> > message MessageIdData {
> >     required uint64 ledgerId = 1;
> >     required uint64 entryId  = 2;
> >     optional int32 partition = 3 [default = -1];
> >     optional int32 batch_index = 4 [default = -1];
> >     repeated int64 ack_set = 5;
> >     optional int32 batch_size = 6;
> >
> >     // For the chunk message id, we need to specify the first chunk message id.
> >     optional MessageIdData first_chunk_message_id = 7;
> > }
> > ```
> >
> > IMO, MessageId should be a wrapper of MessageIdData. It's more natural
> > to have an interface like:
> >
> > ```java
> > interface MessageId {
> >     long ledgerId();
> >     long entryId();
> >     Optional<Integer> partition();
> >     Optional<Integer> batchIndex();
> >     // ...
> > ```
>
> This is very good for client applications.
> We also need a way to represent this as a String or a byte[], this way
> client applications have a standard way to store
> message offsets into an external system (for instance when you want to
> user the Reader API and keep track of the position by yourself)
>
> Enrico
>
> >
> > Additionally, there are many places that use only the triple of
> > (ledger id, entry id, batch index) as the key to represent the position.
> > Currently, they are done by adding a conversion from
> > BatchMessageIdImpl to MessageIdImpl. However, it's more intuitive to
> > write something like:
> >
> > ```java
> > class MessageIdPosition implements Comparable<MessageIdPosition> {
> >     private final MessageId messageId;
> >     // TODO: compare only the triple (ledger, entry, batch)
> > ```
> >
> > Therefore, I'm going to write a proposal to redesign the MessageId
> > interface only by adding some getters. Regarding the 5 existing
> > implementations, I think we can drop them because they are a part
> > of `pulsar-client`, not `pulsar-client-api`.
> >
> > Please feel free to share your points.
> >
> > Thanks,
> > Yunze

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Enrico Olivelli <eo...@gmail.com>.
Il giorno mar 8 nov 2022 alle ore 11:25 Yunze Xu
<yz...@streamnative.io.invalid> ha scritto:
>
> Hi all,
>
> Currently we have the following 5 implementations of MessageId:
>
> - MessageIdImpl: (ledger id, entry id, partition index)
>   - BatchMessageIdImpl: adds (batch index, batch size, acker), where
>     acker is a wrapper of a BitSet.
>   - ChunkMessageIdImpl: adds another MessageIdImpl that represents
>     the first MessageIdImpl of a BitSet.
>   - MultiMessageIdImpl: adds a map that maps the topic name to the
>     MessageId.
> - TopicMessageIdImpl: adds the topic name and the partition name
>
> These implementations are such a mess. For example, when users get a
> MessageId from `Producer#send`:
>
> ```java
> var id = producer.send("msg");
> ```
>
> There is no getter to get some specific fields like ledger id. You can
> only see a representation from `toString` method and got some output
> like "0:0:-1:0". Maybe this design is to hidden some details, but if
> users don't know the details like ledger id and entry id, how could
> you know what does "0:0:-1:0" mean? What if `MessageId#toString`'s
> implementation changed? Should it be treated as a breaking change?
>
> The original definition of the underlying MessageIdData is much more
> clear:
>
> ```proto
> message MessageIdData {
>     required uint64 ledgerId = 1;
>     required uint64 entryId  = 2;
>     optional int32 partition = 3 [default = -1];
>     optional int32 batch_index = 4 [default = -1];
>     repeated int64 ack_set = 5;
>     optional int32 batch_size = 6;
>
>     // For the chunk message id, we need to specify the first chunk message id.
>     optional MessageIdData first_chunk_message_id = 7;
> }
> ```
>
> IMO, MessageId should be a wrapper of MessageIdData. It's more natural
> to have an interface like:
>
> ```java
> interface MessageId {
>     long ledgerId();
>     long entryId();
>     Optional<Integer> partition();
>     Optional<Integer> batchIndex();
>     // ...
> ```

This is very good for client applications.
We also need a way to represent this as a String or a byte[], this way
client applications have a standard way to store
message offsets into an external system (for instance when you want to
user the Reader API and keep track of the position by yourself)

Enrico

>
> Additionally, there are many places that use only the triple of
> (ledger id, entry id, batch index) as the key to represent the position.
> Currently, they are done by adding a conversion from
> BatchMessageIdImpl to MessageIdImpl. However, it's more intuitive to
> write something like:
>
> ```java
> class MessageIdPosition implements Comparable<MessageIdPosition> {
>     private final MessageId messageId;
>     // TODO: compare only the triple (ledger, entry, batch)
> ```
>
> Therefore, I'm going to write a proposal to redesign the MessageId
> interface only by adding some getters. Regarding the 5 existing
> implementations, I think we can drop them because they are a part
> of `pulsar-client`, not `pulsar-client-api`.
>
> Please feel free to share your points.
>
> Thanks,
> Yunze

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Yunze Xu <yz...@streamnative.io.INVALID>.
Hi Haiting,

> But please make sure we have to make it compatible with previous
implementations, like the `toString` method

Yeah, I agree, I will keep it compatible.

BTW, while I'm working on this, I found the MessageId implementations
are more complicated than I thought. The MessageIdImpl class must be a
POJO. Otherwise it cannot be passed into the admin API as the Entity. But I
still insists on returning an Optional<T> instead of T with default
value explained
in API docs.


Thanks,
Yunze

On Wed, Nov 9, 2022 at 3:02 PM Haiting Jiang <ji...@gmail.com> wrote:
>
> Overall, this makes sense to me.
> The current status of MessageId is a bit messy, especially for client
> developers and senior users who are interested in the implementation
> details.
> But please make sure we have to make it compatible with previous
> implementations, like the `toString` method, I bet someone has already
> done the parsing and got the ledgerId and entryId from it.
>
>
> Thanks,
> Haiting
>
> On Tue, Nov 8, 2022 at 6:25 PM Yunze Xu <yz...@streamnative.io.invalid> wrote:
> >
> > Hi all,
> >
> > Currently we have the following 5 implementations of MessageId:
> >
> > - MessageIdImpl: (ledger id, entry id, partition index)
> >   - BatchMessageIdImpl: adds (batch index, batch size, acker), where
> >     acker is a wrapper of a BitSet.
> >   - ChunkMessageIdImpl: adds another MessageIdImpl that represents
> >     the first MessageIdImpl of a BitSet.
> >   - MultiMessageIdImpl: adds a map that maps the topic name to the
> >     MessageId.
> > - TopicMessageIdImpl: adds the topic name and the partition name
> >
> > These implementations are such a mess. For example, when users get a
> > MessageId from `Producer#send`:
> >
> > ```java
> > var id = producer.send("msg");
> > ```
> >
> > There is no getter to get some specific fields like ledger id. You can
> > only see a representation from `toString` method and got some output
> > like "0:0:-1:0". Maybe this design is to hidden some details, but if
> > users don't know the details like ledger id and entry id, how could
> > you know what does "0:0:-1:0" mean? What if `MessageId#toString`'s
> > implementation changed? Should it be treated as a breaking change?
> >
> > The original definition of the underlying MessageIdData is much more
> > clear:
> >
> > ```proto
> > message MessageIdData {
> >     required uint64 ledgerId = 1;
> >     required uint64 entryId  = 2;
> >     optional int32 partition = 3 [default = -1];
> >     optional int32 batch_index = 4 [default = -1];
> >     repeated int64 ack_set = 5;
> >     optional int32 batch_size = 6;
> >
> >     // For the chunk message id, we need to specify the first chunk message id.
> >     optional MessageIdData first_chunk_message_id = 7;
> > }
> > ```
> >
> > IMO, MessageId should be a wrapper of MessageIdData. It's more natural
> > to have an interface like:
> >
> > ```java
> > interface MessageId {
> >     long ledgerId();
> >     long entryId();
> >     Optional<Integer> partition();
> >     Optional<Integer> batchIndex();
> >     // ...
> > ```
> >
> > Additionally, there are many places that use only the triple of
> > (ledger id, entry id, batch index) as the key to represent the position.
> > Currently, they are done by adding a conversion from
> > BatchMessageIdImpl to MessageIdImpl. However, it's more intuitive to
> > write something like:
> >
> > ```java
> > class MessageIdPosition implements Comparable<MessageIdPosition> {
> >     private final MessageId messageId;
> >     // TODO: compare only the triple (ledger, entry, batch)
> > ```
> >
> > Therefore, I'm going to write a proposal to redesign the MessageId
> > interface only by adding some getters. Regarding the 5 existing
> > implementations, I think we can drop them because they are a part
> > of `pulsar-client`, not `pulsar-client-api`.
> >
> > Please feel free to share your points.
> >
> > Thanks,
> > Yunze

Re: [DISCUSSION] Redesign the MessageId interface

Posted by Haiting Jiang <ji...@gmail.com>.
Overall, this makes sense to me.
The current status of MessageId is a bit messy, especially for client
developers and senior users who are interested in the implementation
details.
But please make sure we have to make it compatible with previous
implementations, like the `toString` method, I bet someone has already
done the parsing and got the ledgerId and entryId from it.


Thanks,
Haiting

On Tue, Nov 8, 2022 at 6:25 PM Yunze Xu <yz...@streamnative.io.invalid> wrote:
>
> Hi all,
>
> Currently we have the following 5 implementations of MessageId:
>
> - MessageIdImpl: (ledger id, entry id, partition index)
>   - BatchMessageIdImpl: adds (batch index, batch size, acker), where
>     acker is a wrapper of a BitSet.
>   - ChunkMessageIdImpl: adds another MessageIdImpl that represents
>     the first MessageIdImpl of a BitSet.
>   - MultiMessageIdImpl: adds a map that maps the topic name to the
>     MessageId.
> - TopicMessageIdImpl: adds the topic name and the partition name
>
> These implementations are such a mess. For example, when users get a
> MessageId from `Producer#send`:
>
> ```java
> var id = producer.send("msg");
> ```
>
> There is no getter to get some specific fields like ledger id. You can
> only see a representation from `toString` method and got some output
> like "0:0:-1:0". Maybe this design is to hidden some details, but if
> users don't know the details like ledger id and entry id, how could
> you know what does "0:0:-1:0" mean? What if `MessageId#toString`'s
> implementation changed? Should it be treated as a breaking change?
>
> The original definition of the underlying MessageIdData is much more
> clear:
>
> ```proto
> message MessageIdData {
>     required uint64 ledgerId = 1;
>     required uint64 entryId  = 2;
>     optional int32 partition = 3 [default = -1];
>     optional int32 batch_index = 4 [default = -1];
>     repeated int64 ack_set = 5;
>     optional int32 batch_size = 6;
>
>     // For the chunk message id, we need to specify the first chunk message id.
>     optional MessageIdData first_chunk_message_id = 7;
> }
> ```
>
> IMO, MessageId should be a wrapper of MessageIdData. It's more natural
> to have an interface like:
>
> ```java
> interface MessageId {
>     long ledgerId();
>     long entryId();
>     Optional<Integer> partition();
>     Optional<Integer> batchIndex();
>     // ...
> ```
>
> Additionally, there are many places that use only the triple of
> (ledger id, entry id, batch index) as the key to represent the position.
> Currently, they are done by adding a conversion from
> BatchMessageIdImpl to MessageIdImpl. However, it's more intuitive to
> write something like:
>
> ```java
> class MessageIdPosition implements Comparable<MessageIdPosition> {
>     private final MessageId messageId;
>     // TODO: compare only the triple (ledger, entry, batch)
> ```
>
> Therefore, I'm going to write a proposal to redesign the MessageId
> interface only by adding some getters. Regarding the 5 existing
> implementations, I think we can drop them because they are a part
> of `pulsar-client`, not `pulsar-client-api`.
>
> Please feel free to share your points.
>
> Thanks,
> Yunze