You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@bookkeeper.apache.org by Jiannan Wang <ji...@yahoo-inc.com> on 2013/02/21 03:58:52 UTC

[Discussion] [Hedwig] Add queue semantic support for Hedwig

Hi guys,
   Under current Hedwig semantic, a subscriber cannot aware of messages published before he subscribes the topic. So in following example, subscriber A can only receives messages after seqId 2.
---------------------------------
Topic T: msg1 msg2 msg3 msg4 ...
                     | <- subscriber A subscribe the topic
---------------------------------

   This semantic is very reasonable, but Hedwig client needs to handle this corner case: a new topic is just to be created, and as topic is lazily created by the first request (generally it's PUB or SUB), so the client side must coordinate between publisher and subscriber to make sure the first SUB is handled before the first PUB at this very beginning status (consider subscriber may have very bad network connection which causes SUB failed and user does not want to miss any messages). In summary, it requires special works if there is a subscriber would like to receive all the messages since topic is created, and I think this requirement is very general.

   Handle this problem in client side is a choice, but I think maybe we can simply resolve it in server side if Hedwig can support queue semantic (so that we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). And as I known, the major concern on queue semantic is how long to keep the messages, however:
   1. It is user's responsibility to know about the feature and impact of queue semantic.
   2. On the other hand, we can add a parameter to limit the queue length.

   In a word, here are the two problem I would like to discuss:
   1. How to gracefully resolve the above issue in server side under current semantic.
   2. Whether or not to introduce queue semantic into Hedwig.

Thanks,
Jiannan

Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Flavio Junqueira <fp...@yahoo.com>.
Hi Sijie,

I'd like to clarify one thing in your proposal. A message is only guaranteed to be consumed if we pick an appropriate value of messageBound. If messageBound is small, then it could happen that it is GC'ed before anyone consumes it. Is it right? The burden is on the application to decide which messageBound value is good enough or perhaps to add some synchronization mechanism to avoid losing queued messages in exceptional situations. Exceptional situations are the ones in which consumers cannot make progress. Am I getting it right?

-Flavio

On Feb 23, 2013, at 7:33 PM, Sijie Guo <gu...@gmail.com> wrote:

> On Sat, Feb 23, 2013 at 2:15 AM, Jiannan Wang <ji...@yahoo-inc.com> wrote:
> 
>> Hi Sijie,
>> Thanks for well explaining on the difference between pub/sub model and
>> queue model, I did confuse on them when there is only one subscriber on
>> topic, I just want to invoke queue semantic to get around the problem :)
>> 
>> --------------------
>> two ideas could be proceed to resolve it (similar as what kafka did):
>> 1) have a subscription option to indicate subscribe starting from the
>> latest sequence id or the oldest sequence id.
>> 2) let subscriber managed its consumed ptr and passed the consumed ptr
>> back when subscribe to tell hub server where to start delivery. this
>> subscriber could be a special subscriber distinguished by a subscription
>> option.
>> 
>> several benefits could be made by 2):
>> a) eliminate the storage and access of subscription metadata.
>> b) provided the mechanism to rewind the subscription back for replaying
>> already consumed messages again.
>> --------------------
>> I see the ConsumerConfig class in kafka's api but cannot find related
>> option.
>> 
> 
> sorry that I don't describe clearly. kafka let consumer maintains the
> consumer ptr rather than the server side.
> You could check 1) 'Simple Consumer' section here:
> http://kafka.apache.org/quickstart.html , 2) 'Consume State' section here:
> http://kafka.apache.org/design.html
> 
> 
> 
>> For idea 1), we also need to change current message garbage collection
>> behavior in Hedwig: for topic with no subscriber just keep the message with
>> messageBound limit. I in favor of this solution.
>> idea 2) is cool though it requires large changes compare to 1).
>> 
> 
> Neither 1) nor 2) requires big changes.
> 
> for 1), we could simply have an option '*whence*' in SubscriptionOption,
> indicating when to start subscribe, which have two options: OLDEST, LATEST.
> so when it is first-time subscription, we picked oldest or latest message
> as the consume ptr for this subscription.
> 
> for 2), we could have an optional option 'consumedseqid' in
> SubscriptionOption. if the subsriber provides such option, we used this
> provided 'consumedseqid' as the consume ptr, if the 'consumedseqid' is
> smaller than the oldest message, we should move the pointer to the oldest
> message, and if the 'consumedseqid' is larger than the latest message, we
> should move the pointer to the latest one. if the subscriber doesn't
> provide such option, we could fall back to normal case and apply 1).
> 
> for completeness that I described before for one benefit to eliminate
> storage for metadata is having a special kind of subscriber (having a
> subscription option, 'inmemsubscription', indicating it is just an inmemory
> subscription, hub server just put this subscription in memory during its
> lifetime.). Leveraging above two options, we could have the subscriber
> maintains the subscription state and passed it back when subscribed.
> 
> Both 1) and 2) we need to do following things:
> 
> a) change the garbage collection policy to keep messages aligned with
> messageBound limitation.
> b) read the oldest message seq id from persistence manager. this is the
> core part we need to improve to achieve 'subscribe the oldest' semantic.
> one place we need to take care when reading the oldest message seq id: we
> could not simply use the first seq id in LedgerRanges, since the first
> ledger might already deleted but not removed from ledger ranges metadata.
> (it is caused because there is no transaction between ledger metadata and
> hedwig metadata).
> 
> so 1) and 2) are not two opposite solution. they could be done together
> with same changes.
> 
> 
> 
>> 
>> I see Flavio's reply to Yannick which suggests using ZooKeeper to
>> coordinate the actions of publisher and subscriber. But it's a client-side
>> solution, I would prefer solution 1) in Sijie's proposal which requires no
>> special works in client side.
>> 
>> Thanks,
>> Jiannan
>> 
>> 
>> From: Sijie Guo <gu...@gmail.com>
>> Reply-To: "bookkeeper-user@zookeeper.apache.org" <
>> bookkeeper-user@zookeeper.apache.org>
>> Date: Thursday, February 21, 2013 4:50 PM
>> To: "bookkeeper-dev@zookeeper.apache.org" <
>> bookkeeper-dev@zookeeper.apache.org>
>> Cc: "bookkeeper-user@zookeeper.apache.org" <
>> bookkeeper-user@zookeeper.apache.org>, Hang Qi <ha...@yahoo-inc.com>,
>> Hongjian Chen <ho...@yahoo-inc.com>, Bizhu Qiu <qi...@yahoo-inc.com>,
>> Fangmin Lv <lv...@yahoo-inc.com>, Lin Shen <sh...@yahoo-inc.com>
>> 
>> Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig
>> 
>> Thanks Jiannan for raising the discussion of queue semantic. There was
>> some other guys in the mail list asked for queue semantic before.
>> 
>> Basically, topic (pub/sub) is quite different from queue in messaging
>> concepts. In pub/sub model, when a publisher publish a message, it goes to
>> all the consumers (subscribers) who are interested; while a queue model
>> implements a load balancer semantic. A single message would be consumed
>> almost exactly by one consumer. It means that a queue has many consumers
>> with messages load balanced across the available consumers.
>> 
>> If the application requires all consumers seen same view of published
>> messages, a topic is better for it. If the application doesn't matter who
>> would receive and consume the published messages, a queue is better. But
>> these two concepts become similar when there are only one consumer. It
>> might make you confused on using a queue or a topic.
>> 
>> for your case, it is still a pub/sub application. so your first question
>> is how to handle this case gracefully in a pub/sub model. two ideas could
>> be proceed to resolve it (similar as what kafka did):
>> 
>> 1) have a subscription option to indicate subscribe starting from the
>> latest sequence id or the oldest sequence id.
>> 
>> 2) let subscriber managed its consumed ptr and passed the consumed ptr
>> back when subscribe to tell hub server where to start delivery. this
>> subscriber could be a special subscriber distinguished by a subscription
>> option.
>> 
>> several benefits could be made by 2):
>> 
>> a) eliminate the storage and access of subscription metadata.
>> b) provided the mechanism to rewind the subscription back for replaying
>> already consumed messages again.
>> 
>> for the garbage collection stuff you mentioned on how long to keep the
>> messages, we already have messageBound to limit the length of a topic. We
>> don't need to worry about it.
>> 
>> for your second question, it might be nice to have the queue semantic in
>> Hedwig, since JMS implementation needs it. But implementing the queue
>> semantic is totally a different story than pub/sub.
>> 
>> -Sijie
>> 
>> 
>> On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com>wrote:
>> 
>>> Hi guys,
>>> Under current Hedwig semantic, a subscriber cannot aware of messages
>>> published before he subscribes the topic. So in following example,
>>> subscriber A can only receives messages after seqId 2.
>>> ---------------------------------
>>> Topic T: msg1 msg2 msg3 msg4 ...
>>>                   | <- subscriber A subscribe the topic
>>> ---------------------------------
>>> 
>>> This semantic is very reasonable, but Hedwig client needs to handle
>>> this corner case: a new topic is just to be created, and as topic is lazily
>>> created by the first request (generally it's PUB or SUB), so the client
>>> side must coordinate between publisher and subscriber to make sure the
>>> first SUB is handled before the first PUB at this very beginning status
>>> (consider subscriber may have very bad network connection which causes SUB
>>> failed and user does not want to miss any messages). In summary, it
>>> requires special works if there is a subscriber would like to receive all
>>> the messages since topic is created, and I think this requirement is very
>>> general.
>>> 
>>> Handle this problem in client side is a choice, but I think maybe we
>>> can simply resolve it  in server side if Hedwig can support queue semantic
>>> (so that we can also extend Hedwig JMS provider to support JMS queue in
>>> BOOKKEEPER-312). And as I known, the major concern on queue semantic is how
>>> long to keep the messages, however:
>>> 1. It is user's responsibility to know about the feature and impact of
>>> queue semantic.
>>> 2. On the other hand, we can add a parameter to limit the queue length.
>>> 
>>> In a word, here are the two problem I would like to discuss:
>>> 1. How to gracefully resolve the above issue in server side under
>>> current semantic.
>>> 2. Whether or not to introduce queue semantic into Hedwig.
>>> 
>>> Thanks,
>>> Jiannan
>>> 
>> 
>> 


Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Flavio Junqueira <fp...@yahoo.com>.
Hi Sijie,

I'd like to clarify one thing in your proposal. A message is only guaranteed to be consumed if we pick an appropriate value of messageBound. If messageBound is small, then it could happen that it is GC'ed before anyone consumes it. Is it right? The burden is on the application to decide which messageBound value is good enough or perhaps to add some synchronization mechanism to avoid losing queued messages in exceptional situations. Exceptional situations are the ones in which consumers cannot make progress. Am I getting it right?

-Flavio

On Feb 23, 2013, at 7:33 PM, Sijie Guo <gu...@gmail.com> wrote:

> On Sat, Feb 23, 2013 at 2:15 AM, Jiannan Wang <ji...@yahoo-inc.com> wrote:
> 
>> Hi Sijie,
>> Thanks for well explaining on the difference between pub/sub model and
>> queue model, I did confuse on them when there is only one subscriber on
>> topic, I just want to invoke queue semantic to get around the problem :)
>> 
>> --------------------
>> two ideas could be proceed to resolve it (similar as what kafka did):
>> 1) have a subscription option to indicate subscribe starting from the
>> latest sequence id or the oldest sequence id.
>> 2) let subscriber managed its consumed ptr and passed the consumed ptr
>> back when subscribe to tell hub server where to start delivery. this
>> subscriber could be a special subscriber distinguished by a subscription
>> option.
>> 
>> several benefits could be made by 2):
>> a) eliminate the storage and access of subscription metadata.
>> b) provided the mechanism to rewind the subscription back for replaying
>> already consumed messages again.
>> --------------------
>> I see the ConsumerConfig class in kafka's api but cannot find related
>> option.
>> 
> 
> sorry that I don't describe clearly. kafka let consumer maintains the
> consumer ptr rather than the server side.
> You could check 1) 'Simple Consumer' section here:
> http://kafka.apache.org/quickstart.html , 2) 'Consume State' section here:
> http://kafka.apache.org/design.html
> 
> 
> 
>> For idea 1), we also need to change current message garbage collection
>> behavior in Hedwig: for topic with no subscriber just keep the message with
>> messageBound limit. I in favor of this solution.
>> idea 2) is cool though it requires large changes compare to 1).
>> 
> 
> Neither 1) nor 2) requires big changes.
> 
> for 1), we could simply have an option '*whence*' in SubscriptionOption,
> indicating when to start subscribe, which have two options: OLDEST, LATEST.
> so when it is first-time subscription, we picked oldest or latest message
> as the consume ptr for this subscription.
> 
> for 2), we could have an optional option 'consumedseqid' in
> SubscriptionOption. if the subsriber provides such option, we used this
> provided 'consumedseqid' as the consume ptr, if the 'consumedseqid' is
> smaller than the oldest message, we should move the pointer to the oldest
> message, and if the 'consumedseqid' is larger than the latest message, we
> should move the pointer to the latest one. if the subscriber doesn't
> provide such option, we could fall back to normal case and apply 1).
> 
> for completeness that I described before for one benefit to eliminate
> storage for metadata is having a special kind of subscriber (having a
> subscription option, 'inmemsubscription', indicating it is just an inmemory
> subscription, hub server just put this subscription in memory during its
> lifetime.). Leveraging above two options, we could have the subscriber
> maintains the subscription state and passed it back when subscribed.
> 
> Both 1) and 2) we need to do following things:
> 
> a) change the garbage collection policy to keep messages aligned with
> messageBound limitation.
> b) read the oldest message seq id from persistence manager. this is the
> core part we need to improve to achieve 'subscribe the oldest' semantic.
> one place we need to take care when reading the oldest message seq id: we
> could not simply use the first seq id in LedgerRanges, since the first
> ledger might already deleted but not removed from ledger ranges metadata.
> (it is caused because there is no transaction between ledger metadata and
> hedwig metadata).
> 
> so 1) and 2) are not two opposite solution. they could be done together
> with same changes.
> 
> 
> 
>> 
>> I see Flavio's reply to Yannick which suggests using ZooKeeper to
>> coordinate the actions of publisher and subscriber. But it's a client-side
>> solution, I would prefer solution 1) in Sijie's proposal which requires no
>> special works in client side.
>> 
>> Thanks,
>> Jiannan
>> 
>> 
>> From: Sijie Guo <gu...@gmail.com>
>> Reply-To: "bookkeeper-user@zookeeper.apache.org" <
>> bookkeeper-user@zookeeper.apache.org>
>> Date: Thursday, February 21, 2013 4:50 PM
>> To: "bookkeeper-dev@zookeeper.apache.org" <
>> bookkeeper-dev@zookeeper.apache.org>
>> Cc: "bookkeeper-user@zookeeper.apache.org" <
>> bookkeeper-user@zookeeper.apache.org>, Hang Qi <ha...@yahoo-inc.com>,
>> Hongjian Chen <ho...@yahoo-inc.com>, Bizhu Qiu <qi...@yahoo-inc.com>,
>> Fangmin Lv <lv...@yahoo-inc.com>, Lin Shen <sh...@yahoo-inc.com>
>> 
>> Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig
>> 
>> Thanks Jiannan for raising the discussion of queue semantic. There was
>> some other guys in the mail list asked for queue semantic before.
>> 
>> Basically, topic (pub/sub) is quite different from queue in messaging
>> concepts. In pub/sub model, when a publisher publish a message, it goes to
>> all the consumers (subscribers) who are interested; while a queue model
>> implements a load balancer semantic. A single message would be consumed
>> almost exactly by one consumer. It means that a queue has many consumers
>> with messages load balanced across the available consumers.
>> 
>> If the application requires all consumers seen same view of published
>> messages, a topic is better for it. If the application doesn't matter who
>> would receive and consume the published messages, a queue is better. But
>> these two concepts become similar when there are only one consumer. It
>> might make you confused on using a queue or a topic.
>> 
>> for your case, it is still a pub/sub application. so your first question
>> is how to handle this case gracefully in a pub/sub model. two ideas could
>> be proceed to resolve it (similar as what kafka did):
>> 
>> 1) have a subscription option to indicate subscribe starting from the
>> latest sequence id or the oldest sequence id.
>> 
>> 2) let subscriber managed its consumed ptr and passed the consumed ptr
>> back when subscribe to tell hub server where to start delivery. this
>> subscriber could be a special subscriber distinguished by a subscription
>> option.
>> 
>> several benefits could be made by 2):
>> 
>> a) eliminate the storage and access of subscription metadata.
>> b) provided the mechanism to rewind the subscription back for replaying
>> already consumed messages again.
>> 
>> for the garbage collection stuff you mentioned on how long to keep the
>> messages, we already have messageBound to limit the length of a topic. We
>> don't need to worry about it.
>> 
>> for your second question, it might be nice to have the queue semantic in
>> Hedwig, since JMS implementation needs it. But implementing the queue
>> semantic is totally a different story than pub/sub.
>> 
>> -Sijie
>> 
>> 
>> On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com>wrote:
>> 
>>> Hi guys,
>>> Under current Hedwig semantic, a subscriber cannot aware of messages
>>> published before he subscribes the topic. So in following example,
>>> subscriber A can only receives messages after seqId 2.
>>> ---------------------------------
>>> Topic T: msg1 msg2 msg3 msg4 ...
>>>                   | <- subscriber A subscribe the topic
>>> ---------------------------------
>>> 
>>> This semantic is very reasonable, but Hedwig client needs to handle
>>> this corner case: a new topic is just to be created, and as topic is lazily
>>> created by the first request (generally it's PUB or SUB), so the client
>>> side must coordinate between publisher and subscriber to make sure the
>>> first SUB is handled before the first PUB at this very beginning status
>>> (consider subscriber may have very bad network connection which causes SUB
>>> failed and user does not want to miss any messages). In summary, it
>>> requires special works if there is a subscriber would like to receive all
>>> the messages since topic is created, and I think this requirement is very
>>> general.
>>> 
>>> Handle this problem in client side is a choice, but I think maybe we
>>> can simply resolve it  in server side if Hedwig can support queue semantic
>>> (so that we can also extend Hedwig JMS provider to support JMS queue in
>>> BOOKKEEPER-312). And as I known, the major concern on queue semantic is how
>>> long to keep the messages, however:
>>> 1. It is user's responsibility to know about the feature and impact of
>>> queue semantic.
>>> 2. On the other hand, we can add a parameter to limit the queue length.
>>> 
>>> In a word, here are the two problem I would like to discuss:
>>> 1. How to gracefully resolve the above issue in server side under
>>> current semantic.
>>> 2. Whether or not to introduce queue semantic into Hedwig.
>>> 
>>> Thanks,
>>> Jiannan
>>> 
>> 
>> 


Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Jiannan Wang <ji...@yahoo-inc.com>.
Thanks Sijie for supply more detail information about kafka.

so 1) and 2) are not two opposite solution. they could be done together with same changes.
-------------
Yes, you are right, changes in server side is quite similar. I said 2) requires large change is because I thought we should include works on recording consume sequence id in client side but it seems it's the App's responsibility now.
I'll create JIRA for it.

Really thanks for your guys who join this discussion.

Regards,
Jiannan


From: Sijie Guo <gu...@gmail.com>>
Date: Sunday, February 24, 2013 2:33 AM
To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>, "bookkeeper-dev@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>, "Yahoo! Inc." <ji...@yahoo-inc.com>>
Cc: Hang Qi <ha...@yahoo-inc.com>>, Hongjian Chen <ho...@yahoo-inc.com>>, Bizhu Qiu <qi...@yahoo-inc.com>>, Fangmin Lv <lv...@yahoo-inc.com>>, Lin Shen <sh...@yahoo-inc.com>>
Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig




On Sat, Feb 23, 2013 at 2:15 AM, Jiannan Wang <ji...@yahoo-inc.com>> wrote:
Hi Sijie,
   Thanks for well explaining on the difference between pub/sub model and queue model, I did confuse on them when there is only one subscriber on topic, I just want to invoke queue semantic to get around the problem :)

--------------------
two ideas could be proceed to resolve it (similar as what kafka did):
1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.
2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):
a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.
--------------------
I see the ConsumerConfig class in kafka's api but cannot find related option.

sorry that I don't describe clearly. kafka let consumer maintains the consumer ptr rather than the server side.
You could check 1) 'Simple Consumer' section here: http://kafka.apache.org/quickstart.html , 2) 'Consume State' section here: http://kafka.apache.org/design.html


For idea 1), we also need to change current message garbage collection behavior in Hedwig: for topic with no subscriber just keep the message with messageBound limit. I in favor of this solution.
idea 2) is cool though it requires large changes compare to 1).

Neither 1) nor 2) requires big changes.

for 1), we could simply have an option 'whence' in SubscriptionOption, indicating when to start subscribe, which have two options: OLDEST, LATEST. so when it is first-time subscription, we picked oldest or latest message as the consume ptr for this subscription.

for 2), we could have an optional option 'consumedseqid' in SubscriptionOption. if the subsriber provides such option, we used this provided 'consumedseqid' as the consume ptr, if the 'consumedseqid' is smaller than the oldest message, we should move the pointer to the oldest message, and if the 'consumedseqid' is larger than the latest message, we should move the pointer to the latest one. if the subscriber doesn't provide such option, we could fall back to normal case and apply 1).

for completeness that I described before for one benefit to eliminate storage for metadata is having a special kind of subscriber (having a subscription option, 'inmemsubscription', indicating it is just an inmemory subscription, hub server just put this subscription in memory during its lifetime.). Leveraging above two options, we could have the subscriber maintains the subscription state and passed it back when subscribed.

Both 1) and 2) we need to do following things:

a) change the garbage collection policy to keep messages aligned with messageBound limitation.
b) read the oldest message seq id from persistence manager. this is the core part we need to improve to achieve 'subscribe the oldest' semantic. one place we need to take care when reading the oldest message seq id: we could not simply use the first seq id in LedgerRanges, since the first ledger might already deleted but not removed from ledger ranges metadata. (it is caused because there is no transaction between ledger metadata and hedwig metadata).

so 1) and 2) are not two opposite solution. they could be done together with same changes.



I see Flavio's reply to Yannick which suggests using ZooKeeper to coordinate the actions of publisher and subscriber. But it's a client-side solution, I would prefer solution 1) in Sijie's proposal which requires no special works in client side.

Thanks,
Jiannan


From: Sijie Guo <gu...@gmail.com>>
Reply-To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Date: Thursday, February 21, 2013 4:50 PM
To: "bookkeeper-dev@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Cc: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>, Hang Qi <ha...@yahoo-inc.com>>, Hongjian Chen <ho...@yahoo-inc.com>>, Bizhu Qiu <qi...@yahoo-inc.com>>, Fangmin Lv <lv...@yahoo-inc.com>>, Lin Shen <sh...@yahoo-inc.com>>

Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Thanks Jiannan for raising the discussion of queue semantic. There was some other guys in the mail list asked for queue semantic before.

Basically, topic (pub/sub) is quite different from queue in messaging concepts. In pub/sub model, when a publisher publish a message, it goes to all the consumers (subscribers) who are interested; while a queue model implements a load balancer semantic. A single message would be consumed almost exactly by one consumer. It means that a queue has many consumers with messages load balanced across the available consumers.

If the application requires all consumers seen same view of published messages, a topic is better for it. If the application doesn't matter who would receive and consume the published messages, a queue is better. But these two concepts become similar when there are only one consumer. It might make you confused on using a queue or a topic.

for your case, it is still a pub/sub application. so your first question is how to handle this case gracefully in a pub/sub model. two ideas could be proceed to resolve it (similar as what kafka did):

1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.

2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):

a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.

for the garbage collection stuff you mentioned on how long to keep the messages, we already have messageBound to limit the length of a topic. We don't need to worry about it.

for your second question, it might be nice to have the queue semantic in Hedwig, since JMS implementation needs it. But implementing the queue semantic is totally a different story than pub/sub.

-Sijie


On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com>> wrote:
Hi guys,
   Under current Hedwig semantic, a subscriber cannot aware of messages published before he subscribes the topic. So in following example, subscriber A can only receives messages after seqId 2.
---------------------------------
Topic T: msg1 msg2 msg3 msg4 ...
                     | <- subscriber A subscribe the topic
---------------------------------

   This semantic is very reasonable, but Hedwig client needs to handle this corner case: a new topic is just to be created, and as topic is lazily created by the first request (generally it's PUB or SUB), so the client side must coordinate between publisher and subscriber to make sure the first SUB is handled before the first PUB at this very beginning status (consider subscriber may have very bad network connection which causes SUB failed and user does not want to miss any messages). In summary, it requires special works if there is a subscriber would like to receive all the messages since topic is created, and I think this requirement is very general.

   Handle this problem in client side is a choice, but I think maybe we can simply resolve it  in server side if Hedwig can support queue semantic (so that we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). And as I known, the major concern on queue semantic is how long to keep the messages, however:
   1. It is user's responsibility to know about the feature and impact of queue semantic.
   2. On the other hand, we can add a parameter to limit the queue length.

   In a word, here are the two problem I would like to discuss:
   1. How to gracefully resolve the above issue in server side under current semantic.
   2. Whether or not to introduce queue semantic into Hedwig.

Thanks,
Jiannan



Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Jiannan Wang <ji...@yahoo-inc.com>.
Thanks Sijie for supply more detail information about kafka.

so 1) and 2) are not two opposite solution. they could be done together with same changes.
-------------
Yes, you are right, changes in server side is quite similar. I said 2) requires large change is because I thought we should include works on recording consume sequence id in client side but it seems it's the App's responsibility now.
I'll create JIRA for it.

Really thanks for your guys who join this discussion.

Regards,
Jiannan


From: Sijie Guo <gu...@gmail.com>>
Date: Sunday, February 24, 2013 2:33 AM
To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>, "bookkeeper-dev@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>, "Yahoo! Inc." <ji...@yahoo-inc.com>>
Cc: Hang Qi <ha...@yahoo-inc.com>>, Hongjian Chen <ho...@yahoo-inc.com>>, Bizhu Qiu <qi...@yahoo-inc.com>>, Fangmin Lv <lv...@yahoo-inc.com>>, Lin Shen <sh...@yahoo-inc.com>>
Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig




On Sat, Feb 23, 2013 at 2:15 AM, Jiannan Wang <ji...@yahoo-inc.com>> wrote:
Hi Sijie,
   Thanks for well explaining on the difference between pub/sub model and queue model, I did confuse on them when there is only one subscriber on topic, I just want to invoke queue semantic to get around the problem :)

--------------------
two ideas could be proceed to resolve it (similar as what kafka did):
1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.
2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):
a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.
--------------------
I see the ConsumerConfig class in kafka's api but cannot find related option.

sorry that I don't describe clearly. kafka let consumer maintains the consumer ptr rather than the server side.
You could check 1) 'Simple Consumer' section here: http://kafka.apache.org/quickstart.html , 2) 'Consume State' section here: http://kafka.apache.org/design.html


For idea 1), we also need to change current message garbage collection behavior in Hedwig: for topic with no subscriber just keep the message with messageBound limit. I in favor of this solution.
idea 2) is cool though it requires large changes compare to 1).

Neither 1) nor 2) requires big changes.

for 1), we could simply have an option 'whence' in SubscriptionOption, indicating when to start subscribe, which have two options: OLDEST, LATEST. so when it is first-time subscription, we picked oldest or latest message as the consume ptr for this subscription.

for 2), we could have an optional option 'consumedseqid' in SubscriptionOption. if the subsriber provides such option, we used this provided 'consumedseqid' as the consume ptr, if the 'consumedseqid' is smaller than the oldest message, we should move the pointer to the oldest message, and if the 'consumedseqid' is larger than the latest message, we should move the pointer to the latest one. if the subscriber doesn't provide such option, we could fall back to normal case and apply 1).

for completeness that I described before for one benefit to eliminate storage for metadata is having a special kind of subscriber (having a subscription option, 'inmemsubscription', indicating it is just an inmemory subscription, hub server just put this subscription in memory during its lifetime.). Leveraging above two options, we could have the subscriber maintains the subscription state and passed it back when subscribed.

Both 1) and 2) we need to do following things:

a) change the garbage collection policy to keep messages aligned with messageBound limitation.
b) read the oldest message seq id from persistence manager. this is the core part we need to improve to achieve 'subscribe the oldest' semantic. one place we need to take care when reading the oldest message seq id: we could not simply use the first seq id in LedgerRanges, since the first ledger might already deleted but not removed from ledger ranges metadata. (it is caused because there is no transaction between ledger metadata and hedwig metadata).

so 1) and 2) are not two opposite solution. they could be done together with same changes.



I see Flavio's reply to Yannick which suggests using ZooKeeper to coordinate the actions of publisher and subscriber. But it's a client-side solution, I would prefer solution 1) in Sijie's proposal which requires no special works in client side.

Thanks,
Jiannan


From: Sijie Guo <gu...@gmail.com>>
Reply-To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Date: Thursday, February 21, 2013 4:50 PM
To: "bookkeeper-dev@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Cc: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>, Hang Qi <ha...@yahoo-inc.com>>, Hongjian Chen <ho...@yahoo-inc.com>>, Bizhu Qiu <qi...@yahoo-inc.com>>, Fangmin Lv <lv...@yahoo-inc.com>>, Lin Shen <sh...@yahoo-inc.com>>

Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Thanks Jiannan for raising the discussion of queue semantic. There was some other guys in the mail list asked for queue semantic before.

Basically, topic (pub/sub) is quite different from queue in messaging concepts. In pub/sub model, when a publisher publish a message, it goes to all the consumers (subscribers) who are interested; while a queue model implements a load balancer semantic. A single message would be consumed almost exactly by one consumer. It means that a queue has many consumers with messages load balanced across the available consumers.

If the application requires all consumers seen same view of published messages, a topic is better for it. If the application doesn't matter who would receive and consume the published messages, a queue is better. But these two concepts become similar when there are only one consumer. It might make you confused on using a queue or a topic.

for your case, it is still a pub/sub application. so your first question is how to handle this case gracefully in a pub/sub model. two ideas could be proceed to resolve it (similar as what kafka did):

1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.

2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):

a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.

for the garbage collection stuff you mentioned on how long to keep the messages, we already have messageBound to limit the length of a topic. We don't need to worry about it.

for your second question, it might be nice to have the queue semantic in Hedwig, since JMS implementation needs it. But implementing the queue semantic is totally a different story than pub/sub.

-Sijie


On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com>> wrote:
Hi guys,
   Under current Hedwig semantic, a subscriber cannot aware of messages published before he subscribes the topic. So in following example, subscriber A can only receives messages after seqId 2.
---------------------------------
Topic T: msg1 msg2 msg3 msg4 ...
                     | <- subscriber A subscribe the topic
---------------------------------

   This semantic is very reasonable, but Hedwig client needs to handle this corner case: a new topic is just to be created, and as topic is lazily created by the first request (generally it's PUB or SUB), so the client side must coordinate between publisher and subscriber to make sure the first SUB is handled before the first PUB at this very beginning status (consider subscriber may have very bad network connection which causes SUB failed and user does not want to miss any messages). In summary, it requires special works if there is a subscriber would like to receive all the messages since topic is created, and I think this requirement is very general.

   Handle this problem in client side is a choice, but I think maybe we can simply resolve it  in server side if Hedwig can support queue semantic (so that we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). And as I known, the major concern on queue semantic is how long to keep the messages, however:
   1. It is user's responsibility to know about the feature and impact of queue semantic.
   2. On the other hand, we can add a parameter to limit the queue length.

   In a word, here are the two problem I would like to discuss:
   1. How to gracefully resolve the above issue in server side under current semantic.
   2. Whether or not to introduce queue semantic into Hedwig.

Thanks,
Jiannan



Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Sijie Guo <gu...@gmail.com>.
On Sat, Feb 23, 2013 at 2:15 AM, Jiannan Wang <ji...@yahoo-inc.com> wrote:

> Hi Sijie,
>    Thanks for well explaining on the difference between pub/sub model and
> queue model, I did confuse on them when there is only one subscriber on
> topic, I just want to invoke queue semantic to get around the problem :)
>
> --------------------
> two ideas could be proceed to resolve it (similar as what kafka did):
> 1) have a subscription option to indicate subscribe starting from the
> latest sequence id or the oldest sequence id.
> 2) let subscriber managed its consumed ptr and passed the consumed ptr
> back when subscribe to tell hub server where to start delivery. this
> subscriber could be a special subscriber distinguished by a subscription
> option.
>
> several benefits could be made by 2):
> a) eliminate the storage and access of subscription metadata.
> b) provided the mechanism to rewind the subscription back for replaying
> already consumed messages again.
> --------------------
> I see the ConsumerConfig class in kafka's api but cannot find related
> option.
>

sorry that I don't describe clearly. kafka let consumer maintains the
consumer ptr rather than the server side.
You could check 1) 'Simple Consumer' section here:
http://kafka.apache.org/quickstart.html , 2) 'Consume State' section here:
http://kafka.apache.org/design.html



> For idea 1), we also need to change current message garbage collection
> behavior in Hedwig: for topic with no subscriber just keep the message with
> messageBound limit. I in favor of this solution.
> idea 2) is cool though it requires large changes compare to 1).
>

Neither 1) nor 2) requires big changes.

for 1), we could simply have an option '*whence*' in SubscriptionOption,
indicating when to start subscribe, which have two options: OLDEST, LATEST.
so when it is first-time subscription, we picked oldest or latest message
as the consume ptr for this subscription.

for 2), we could have an optional option 'consumedseqid' in
SubscriptionOption. if the subsriber provides such option, we used this
provided 'consumedseqid' as the consume ptr, if the 'consumedseqid' is
smaller than the oldest message, we should move the pointer to the oldest
message, and if the 'consumedseqid' is larger than the latest message, we
should move the pointer to the latest one. if the subscriber doesn't
provide such option, we could fall back to normal case and apply 1).

for completeness that I described before for one benefit to eliminate
storage for metadata is having a special kind of subscriber (having a
subscription option, 'inmemsubscription', indicating it is just an inmemory
subscription, hub server just put this subscription in memory during its
lifetime.). Leveraging above two options, we could have the subscriber
maintains the subscription state and passed it back when subscribed.

Both 1) and 2) we need to do following things:

a) change the garbage collection policy to keep messages aligned with
messageBound limitation.
b) read the oldest message seq id from persistence manager. this is the
core part we need to improve to achieve 'subscribe the oldest' semantic.
one place we need to take care when reading the oldest message seq id: we
could not simply use the first seq id in LedgerRanges, since the first
ledger might already deleted but not removed from ledger ranges metadata.
(it is caused because there is no transaction between ledger metadata and
hedwig metadata).

so 1) and 2) are not two opposite solution. they could be done together
with same changes.



>
> I see Flavio's reply to Yannick which suggests using ZooKeeper to
> coordinate the actions of publisher and subscriber. But it's a client-side
> solution, I would prefer solution 1) in Sijie's proposal which requires no
> special works in client side.
>
> Thanks,
> Jiannan
>
>
> From: Sijie Guo <gu...@gmail.com>
> Reply-To: "bookkeeper-user@zookeeper.apache.org" <
> bookkeeper-user@zookeeper.apache.org>
> Date: Thursday, February 21, 2013 4:50 PM
> To: "bookkeeper-dev@zookeeper.apache.org" <
> bookkeeper-dev@zookeeper.apache.org>
> Cc: "bookkeeper-user@zookeeper.apache.org" <
> bookkeeper-user@zookeeper.apache.org>, Hang Qi <ha...@yahoo-inc.com>,
> Hongjian Chen <ho...@yahoo-inc.com>, Bizhu Qiu <qi...@yahoo-inc.com>,
> Fangmin Lv <lv...@yahoo-inc.com>, Lin Shen <sh...@yahoo-inc.com>
>
> Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig
>
> Thanks Jiannan for raising the discussion of queue semantic. There was
> some other guys in the mail list asked for queue semantic before.
>
> Basically, topic (pub/sub) is quite different from queue in messaging
> concepts. In pub/sub model, when a publisher publish a message, it goes to
> all the consumers (subscribers) who are interested; while a queue model
> implements a load balancer semantic. A single message would be consumed
> almost exactly by one consumer. It means that a queue has many consumers
> with messages load balanced across the available consumers.
>
> If the application requires all consumers seen same view of published
> messages, a topic is better for it. If the application doesn't matter who
> would receive and consume the published messages, a queue is better. But
> these two concepts become similar when there are only one consumer. It
> might make you confused on using a queue or a topic.
>
> for your case, it is still a pub/sub application. so your first question
> is how to handle this case gracefully in a pub/sub model. two ideas could
> be proceed to resolve it (similar as what kafka did):
>
> 1) have a subscription option to indicate subscribe starting from the
> latest sequence id or the oldest sequence id.
>
> 2) let subscriber managed its consumed ptr and passed the consumed ptr
> back when subscribe to tell hub server where to start delivery. this
> subscriber could be a special subscriber distinguished by a subscription
> option.
>
> several benefits could be made by 2):
>
> a) eliminate the storage and access of subscription metadata.
> b) provided the mechanism to rewind the subscription back for replaying
> already consumed messages again.
>
> for the garbage collection stuff you mentioned on how long to keep the
> messages, we already have messageBound to limit the length of a topic. We
> don't need to worry about it.
>
> for your second question, it might be nice to have the queue semantic in
> Hedwig, since JMS implementation needs it. But implementing the queue
> semantic is totally a different story than pub/sub.
>
> -Sijie
>
>
> On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com>wrote:
>
>> Hi guys,
>>    Under current Hedwig semantic, a subscriber cannot aware of messages
>> published before he subscribes the topic. So in following example,
>> subscriber A can only receives messages after seqId 2.
>> ---------------------------------
>> Topic T: msg1 msg2 msg3 msg4 ...
>>                      | <- subscriber A subscribe the topic
>> ---------------------------------
>>
>>    This semantic is very reasonable, but Hedwig client needs to handle
>> this corner case: a new topic is just to be created, and as topic is lazily
>> created by the first request (generally it's PUB or SUB), so the client
>> side must coordinate between publisher and subscriber to make sure the
>> first SUB is handled before the first PUB at this very beginning status
>> (consider subscriber may have very bad network connection which causes SUB
>> failed and user does not want to miss any messages). In summary, it
>> requires special works if there is a subscriber would like to receive all
>> the messages since topic is created, and I think this requirement is very
>> general.
>>
>>    Handle this problem in client side is a choice, but I think maybe we
>> can simply resolve it  in server side if Hedwig can support queue semantic
>> (so that we can also extend Hedwig JMS provider to support JMS queue in
>> BOOKKEEPER-312). And as I known, the major concern on queue semantic is how
>> long to keep the messages, however:
>>    1. It is user's responsibility to know about the feature and impact of
>> queue semantic.
>>    2. On the other hand, we can add a parameter to limit the queue length.
>>
>>    In a word, here are the two problem I would like to discuss:
>>    1. How to gracefully resolve the above issue in server side under
>> current semantic.
>>    2. Whether or not to introduce queue semantic into Hedwig.
>>
>> Thanks,
>> Jiannan
>>
>
>

Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Sijie Guo <gu...@gmail.com>.
On Sat, Feb 23, 2013 at 2:15 AM, Jiannan Wang <ji...@yahoo-inc.com> wrote:

> Hi Sijie,
>    Thanks for well explaining on the difference between pub/sub model and
> queue model, I did confuse on them when there is only one subscriber on
> topic, I just want to invoke queue semantic to get around the problem :)
>
> --------------------
> two ideas could be proceed to resolve it (similar as what kafka did):
> 1) have a subscription option to indicate subscribe starting from the
> latest sequence id or the oldest sequence id.
> 2) let subscriber managed its consumed ptr and passed the consumed ptr
> back when subscribe to tell hub server where to start delivery. this
> subscriber could be a special subscriber distinguished by a subscription
> option.
>
> several benefits could be made by 2):
> a) eliminate the storage and access of subscription metadata.
> b) provided the mechanism to rewind the subscription back for replaying
> already consumed messages again.
> --------------------
> I see the ConsumerConfig class in kafka's api but cannot find related
> option.
>

sorry that I don't describe clearly. kafka let consumer maintains the
consumer ptr rather than the server side.
You could check 1) 'Simple Consumer' section here:
http://kafka.apache.org/quickstart.html , 2) 'Consume State' section here:
http://kafka.apache.org/design.html



> For idea 1), we also need to change current message garbage collection
> behavior in Hedwig: for topic with no subscriber just keep the message with
> messageBound limit. I in favor of this solution.
> idea 2) is cool though it requires large changes compare to 1).
>

Neither 1) nor 2) requires big changes.

for 1), we could simply have an option '*whence*' in SubscriptionOption,
indicating when to start subscribe, which have two options: OLDEST, LATEST.
so when it is first-time subscription, we picked oldest or latest message
as the consume ptr for this subscription.

for 2), we could have an optional option 'consumedseqid' in
SubscriptionOption. if the subsriber provides such option, we used this
provided 'consumedseqid' as the consume ptr, if the 'consumedseqid' is
smaller than the oldest message, we should move the pointer to the oldest
message, and if the 'consumedseqid' is larger than the latest message, we
should move the pointer to the latest one. if the subscriber doesn't
provide such option, we could fall back to normal case and apply 1).

for completeness that I described before for one benefit to eliminate
storage for metadata is having a special kind of subscriber (having a
subscription option, 'inmemsubscription', indicating it is just an inmemory
subscription, hub server just put this subscription in memory during its
lifetime.). Leveraging above two options, we could have the subscriber
maintains the subscription state and passed it back when subscribed.

Both 1) and 2) we need to do following things:

a) change the garbage collection policy to keep messages aligned with
messageBound limitation.
b) read the oldest message seq id from persistence manager. this is the
core part we need to improve to achieve 'subscribe the oldest' semantic.
one place we need to take care when reading the oldest message seq id: we
could not simply use the first seq id in LedgerRanges, since the first
ledger might already deleted but not removed from ledger ranges metadata.
(it is caused because there is no transaction between ledger metadata and
hedwig metadata).

so 1) and 2) are not two opposite solution. they could be done together
with same changes.



>
> I see Flavio's reply to Yannick which suggests using ZooKeeper to
> coordinate the actions of publisher and subscriber. But it's a client-side
> solution, I would prefer solution 1) in Sijie's proposal which requires no
> special works in client side.
>
> Thanks,
> Jiannan
>
>
> From: Sijie Guo <gu...@gmail.com>
> Reply-To: "bookkeeper-user@zookeeper.apache.org" <
> bookkeeper-user@zookeeper.apache.org>
> Date: Thursday, February 21, 2013 4:50 PM
> To: "bookkeeper-dev@zookeeper.apache.org" <
> bookkeeper-dev@zookeeper.apache.org>
> Cc: "bookkeeper-user@zookeeper.apache.org" <
> bookkeeper-user@zookeeper.apache.org>, Hang Qi <ha...@yahoo-inc.com>,
> Hongjian Chen <ho...@yahoo-inc.com>, Bizhu Qiu <qi...@yahoo-inc.com>,
> Fangmin Lv <lv...@yahoo-inc.com>, Lin Shen <sh...@yahoo-inc.com>
>
> Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig
>
> Thanks Jiannan for raising the discussion of queue semantic. There was
> some other guys in the mail list asked for queue semantic before.
>
> Basically, topic (pub/sub) is quite different from queue in messaging
> concepts. In pub/sub model, when a publisher publish a message, it goes to
> all the consumers (subscribers) who are interested; while a queue model
> implements a load balancer semantic. A single message would be consumed
> almost exactly by one consumer. It means that a queue has many consumers
> with messages load balanced across the available consumers.
>
> If the application requires all consumers seen same view of published
> messages, a topic is better for it. If the application doesn't matter who
> would receive and consume the published messages, a queue is better. But
> these two concepts become similar when there are only one consumer. It
> might make you confused on using a queue or a topic.
>
> for your case, it is still a pub/sub application. so your first question
> is how to handle this case gracefully in a pub/sub model. two ideas could
> be proceed to resolve it (similar as what kafka did):
>
> 1) have a subscription option to indicate subscribe starting from the
> latest sequence id or the oldest sequence id.
>
> 2) let subscriber managed its consumed ptr and passed the consumed ptr
> back when subscribe to tell hub server where to start delivery. this
> subscriber could be a special subscriber distinguished by a subscription
> option.
>
> several benefits could be made by 2):
>
> a) eliminate the storage and access of subscription metadata.
> b) provided the mechanism to rewind the subscription back for replaying
> already consumed messages again.
>
> for the garbage collection stuff you mentioned on how long to keep the
> messages, we already have messageBound to limit the length of a topic. We
> don't need to worry about it.
>
> for your second question, it might be nice to have the queue semantic in
> Hedwig, since JMS implementation needs it. But implementing the queue
> semantic is totally a different story than pub/sub.
>
> -Sijie
>
>
> On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com>wrote:
>
>> Hi guys,
>>    Under current Hedwig semantic, a subscriber cannot aware of messages
>> published before he subscribes the topic. So in following example,
>> subscriber A can only receives messages after seqId 2.
>> ---------------------------------
>> Topic T: msg1 msg2 msg3 msg4 ...
>>                      | <- subscriber A subscribe the topic
>> ---------------------------------
>>
>>    This semantic is very reasonable, but Hedwig client needs to handle
>> this corner case: a new topic is just to be created, and as topic is lazily
>> created by the first request (generally it's PUB or SUB), so the client
>> side must coordinate between publisher and subscriber to make sure the
>> first SUB is handled before the first PUB at this very beginning status
>> (consider subscriber may have very bad network connection which causes SUB
>> failed and user does not want to miss any messages). In summary, it
>> requires special works if there is a subscriber would like to receive all
>> the messages since topic is created, and I think this requirement is very
>> general.
>>
>>    Handle this problem in client side is a choice, but I think maybe we
>> can simply resolve it  in server side if Hedwig can support queue semantic
>> (so that we can also extend Hedwig JMS provider to support JMS queue in
>> BOOKKEEPER-312). And as I known, the major concern on queue semantic is how
>> long to keep the messages, however:
>>    1. It is user's responsibility to know about the feature and impact of
>> queue semantic.
>>    2. On the other hand, we can add a parameter to limit the queue length.
>>
>>    In a word, here are the two problem I would like to discuss:
>>    1. How to gracefully resolve the above issue in server side under
>> current semantic.
>>    2. Whether or not to introduce queue semantic into Hedwig.
>>
>> Thanks,
>> Jiannan
>>
>
>

Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Jiannan Wang <ji...@yahoo-inc.com>.
Hi Flavio,

"With this solution, depending on how fast the subscriber reacts, the
message could be garbage collected by the time the subscriber comes in and
says "oldest sequence id", no? In my understanding, the property you're
looking for can only be guaranteed in a best-effort fashion, the system
can't guarantee it will be there."
-------
You are right, so I suggest to slightly change current message garbage
collection strategy in previous mail: "For idea 1), we also need to change
current message garbage collection behavior in Hedwig: for topic with no
subscriber, keep the message  with messageBound limit instead of deleting
all messages."


Thanks,
Jiannan


Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Flavio Junqueira <fp...@yahoo.com>.
On Feb 23, 2013, at 2:01 PM, Yannick Legros <ya...@gmail.com> wrote:

> solution 1) of Sijie "1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id."
> also seem's to be a good solution for my case.


With this solution, depending on how fast the subscriber reacts, the message could be garbage collected by the time the subscriber comes in and says "oldest sequence id", no? In my understanding, the property you're looking for can only be guaranteed in a best-effort fashion, the system can't guarantee it will be there.

-Flavio

Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Yannick Legros <ya...@gmail.com>.
Hi all, And thanks to take consideration of my spécial case ;-).

Flavio say :
"Yannick's message seems to imply that a subscriber can't subscribe before
an external event causing a race between a publisher and the subscriber. If
you solve it, for example, using bookkeeper directly as one of the options
Sijie suggests, you'd end up coordinating using zookeeper or using some
other coordination scheme. Yannick's case seems to be a special one."

--> you are right there is a race between subscriber and publisher, and if
publisher wins, i will loose a message at least. so YES I CAN solve it with
some zookeeper lock to force publisher wait a subscriber call.

solution 1) of Sijie "1) have a subscription option to indicate subscribe
starting from the latest sequence id or the oldest sequence id."
also seem's to be a good solution for my case.

Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Flavio Junqueira <fp...@yahoo.com>.
On Feb 23, 2013, at 11:15 AM, Jiannan Wang <ji...@yahoo-inc.com> wrote:

> 
> I see Flavio's reply to Yannick which suggests using ZooKeeper to coordinate the actions of publisher and subscriber. But it's a client-side solution, I would prefer solution 1) in Sijie's proposal which requires no special works in client side.
> 

Yannick's message seems to imply that a subscriber can't subscribe before an external event causing a race between a publisher and the subscriber. If you solve it, for example, using bookkeeper directly as one of the options Sijie suggests, you'd end up coordinating using zookeeper or using some other coordination scheme. Yannick's case seems to be a special one.

-Flavio

> 
> 
> From: Sijie Guo <gu...@gmail.com>
> Reply-To: "bookkeeper-user@zookeeper.apache.org" <bo...@zookeeper.apache.org>
> Date: Thursday, February 21, 2013 4:50 PM
> To: "bookkeeper-dev@zookeeper.apache.org" <bo...@zookeeper.apache.org>
> Cc: "bookkeeper-user@zookeeper.apache.org" <bo...@zookeeper.apache.org>, Hang Qi <ha...@yahoo-inc.com>, Hongjian Chen <ho...@yahoo-inc.com>, Bizhu Qiu <qi...@yahoo-inc.com>, Fangmin Lv <lv...@yahoo-inc.com>, Lin Shen <sh...@yahoo-inc.com>
> Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig
> 
> Thanks Jiannan for raising the discussion of queue semantic. There was some other guys in the mail list asked for queue semantic before.
> 
> Basically, topic (pub/sub) is quite different from queue in messaging concepts. In pub/sub model, when a publisher publish a message, it goes to all the consumers (subscribers) who are interested; while a queue model implements a load balancer semantic. A single message would be consumed almost exactly by one consumer. It means that a queue has many consumers with messages load balanced across the available consumers.
> 
> If the application requires all consumers seen same view of published messages, a topic is better for it. If the application doesn't matter who would receive and consume the published messages, a queue is better. But these two concepts become similar when there are only one consumer. It might make you confused on using a queue or a topic.
> 
> for your case, it is still a pub/sub application. so your first question is how to handle this case gracefully in a pub/sub model. two ideas could be proceed to resolve it (similar as what kafka did):
> 
> 1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.
> 
> 2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.
> 
> several benefits could be made by 2):
> 
> a) eliminate the storage and access of subscription metadata.
> b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.
> 
> for the garbage collection stuff you mentioned on how long to keep the messages, we already have messageBound to limit the length of a topic. We don't need to worry about it.
> 
> for your second question, it might be nice to have the queue semantic in Hedwig, since JMS implementation needs it. But implementing the queue semantic is totally a different story than pub/sub.
> 
> -Sijie
> 
> 
> On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com> wrote:
>> Hi guys,
>>    Under current Hedwig semantic, a subscriber cannot aware of messages published before he subscribes the topic. So in following example, subscriber A can only receives messages after seqId 2.
>> ---------------------------------
>> Topic T: msg1 msg2 msg3 msg4 ...
>>                      | <- subscriber A subscribe the topic
>> ---------------------------------
>> 
>>    This semantic is very reasonable, but Hedwig client needs to handle this corner case: a new topic is just to be created, and as topic is lazily created by the first request (generally it's PUB or SUB), so the client side must coordinate between publisher and subscriber to make sure the first SUB is handled before the first PUB at this very beginning status (consider subscriber may have very bad network connection which causes SUB failed and user does not want to miss any messages). In summary, it requires special works if there is a subscriber would like to receive all the messages since topic is created, and I think this requirement is very general.
>> 
>>    Handle this problem in client side is a choice, but I think maybe we can simply resolve it  in server side if Hedwig can support queue semantic (so that we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). And as I known, the major concern on queue semantic is how long to keep the messages, however:
>>    1. It is user's responsibility to know about the feature and impact of queue semantic.
>>    2. On the other hand, we can add a parameter to limit the queue length.
>> 
>>    In a word, here are the two problem I would like to discuss:
>>    1. How to gracefully resolve the above issue in server side under current semantic.
>>    2. Whether or not to introduce queue semantic into Hedwig.
>> 
>> Thanks,
>> Jiannan
> 


Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Flavio Junqueira <fp...@yahoo.com>.
On Feb 23, 2013, at 11:15 AM, Jiannan Wang <ji...@yahoo-inc.com> wrote:

> 
> I see Flavio's reply to Yannick which suggests using ZooKeeper to coordinate the actions of publisher and subscriber. But it's a client-side solution, I would prefer solution 1) in Sijie's proposal which requires no special works in client side.
> 

Yannick's message seems to imply that a subscriber can't subscribe before an external event causing a race between a publisher and the subscriber. If you solve it, for example, using bookkeeper directly as one of the options Sijie suggests, you'd end up coordinating using zookeeper or using some other coordination scheme. Yannick's case seems to be a special one.

-Flavio

> 
> 
> From: Sijie Guo <gu...@gmail.com>
> Reply-To: "bookkeeper-user@zookeeper.apache.org" <bo...@zookeeper.apache.org>
> Date: Thursday, February 21, 2013 4:50 PM
> To: "bookkeeper-dev@zookeeper.apache.org" <bo...@zookeeper.apache.org>
> Cc: "bookkeeper-user@zookeeper.apache.org" <bo...@zookeeper.apache.org>, Hang Qi <ha...@yahoo-inc.com>, Hongjian Chen <ho...@yahoo-inc.com>, Bizhu Qiu <qi...@yahoo-inc.com>, Fangmin Lv <lv...@yahoo-inc.com>, Lin Shen <sh...@yahoo-inc.com>
> Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig
> 
> Thanks Jiannan for raising the discussion of queue semantic. There was some other guys in the mail list asked for queue semantic before.
> 
> Basically, topic (pub/sub) is quite different from queue in messaging concepts. In pub/sub model, when a publisher publish a message, it goes to all the consumers (subscribers) who are interested; while a queue model implements a load balancer semantic. A single message would be consumed almost exactly by one consumer. It means that a queue has many consumers with messages load balanced across the available consumers.
> 
> If the application requires all consumers seen same view of published messages, a topic is better for it. If the application doesn't matter who would receive and consume the published messages, a queue is better. But these two concepts become similar when there are only one consumer. It might make you confused on using a queue or a topic.
> 
> for your case, it is still a pub/sub application. so your first question is how to handle this case gracefully in a pub/sub model. two ideas could be proceed to resolve it (similar as what kafka did):
> 
> 1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.
> 
> 2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.
> 
> several benefits could be made by 2):
> 
> a) eliminate the storage and access of subscription metadata.
> b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.
> 
> for the garbage collection stuff you mentioned on how long to keep the messages, we already have messageBound to limit the length of a topic. We don't need to worry about it.
> 
> for your second question, it might be nice to have the queue semantic in Hedwig, since JMS implementation needs it. But implementing the queue semantic is totally a different story than pub/sub.
> 
> -Sijie
> 
> 
> On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com> wrote:
>> Hi guys,
>>    Under current Hedwig semantic, a subscriber cannot aware of messages published before he subscribes the topic. So in following example, subscriber A can only receives messages after seqId 2.
>> ---------------------------------
>> Topic T: msg1 msg2 msg3 msg4 ...
>>                      | <- subscriber A subscribe the topic
>> ---------------------------------
>> 
>>    This semantic is very reasonable, but Hedwig client needs to handle this corner case: a new topic is just to be created, and as topic is lazily created by the first request (generally it's PUB or SUB), so the client side must coordinate between publisher and subscriber to make sure the first SUB is handled before the first PUB at this very beginning status (consider subscriber may have very bad network connection which causes SUB failed and user does not want to miss any messages). In summary, it requires special works if there is a subscriber would like to receive all the messages since topic is created, and I think this requirement is very general.
>> 
>>    Handle this problem in client side is a choice, but I think maybe we can simply resolve it  in server side if Hedwig can support queue semantic (so that we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). And as I known, the major concern on queue semantic is how long to keep the messages, however:
>>    1. It is user's responsibility to know about the feature and impact of queue semantic.
>>    2. On the other hand, we can add a parameter to limit the queue length.
>> 
>>    In a word, here are the two problem I would like to discuss:
>>    1. How to gracefully resolve the above issue in server side under current semantic.
>>    2. Whether or not to introduce queue semantic into Hedwig.
>> 
>> Thanks,
>> Jiannan
> 


Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Jiannan Wang <ji...@yahoo-inc.com>.
Hi Sijie,
   Thanks for well explaining on the difference between pub/sub model and queue model, I did confuse on them when there is only one subscriber on topic, I just want to invoke queue semantic to get around the problem :)

--------------------
two ideas could be proceed to resolve it (similar as what kafka did):
1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.
2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):
a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.
--------------------
I see the ConsumerConfig class in kafka's api but cannot find related option.
For idea 1), we also need to change current message garbage collection behavior in Hedwig: for topic with no subscriber just keep the message with messageBound limit. I in favor of this solution.
idea 2) is cool though it requires large changes compare to 1).

I see Flavio's reply to Yannick which suggests using ZooKeeper to coordinate the actions of publisher and subscriber. But it's a client-side solution, I would prefer solution 1) in Sijie's proposal which requires no special works in client side.

Thanks,
Jiannan


From: Sijie Guo <gu...@gmail.com>>
Reply-To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Date: Thursday, February 21, 2013 4:50 PM
To: "bookkeeper-dev@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Cc: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>, Hang Qi <ha...@yahoo-inc.com>>, Hongjian Chen <ho...@yahoo-inc.com>>, Bizhu Qiu <qi...@yahoo-inc.com>>, Fangmin Lv <lv...@yahoo-inc.com>>, Lin Shen <sh...@yahoo-inc.com>>
Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Thanks Jiannan for raising the discussion of queue semantic. There was some other guys in the mail list asked for queue semantic before.

Basically, topic (pub/sub) is quite different from queue in messaging concepts. In pub/sub model, when a publisher publish a message, it goes to all the consumers (subscribers) who are interested; while a queue model implements a load balancer semantic. A single message would be consumed almost exactly by one consumer. It means that a queue has many consumers with messages load balanced across the available consumers.

If the application requires all consumers seen same view of published messages, a topic is better for it. If the application doesn't matter who would receive and consume the published messages, a queue is better. But these two concepts become similar when there are only one consumer. It might make you confused on using a queue or a topic.

for your case, it is still a pub/sub application. so your first question is how to handle this case gracefully in a pub/sub model. two ideas could be proceed to resolve it (similar as what kafka did):

1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.

2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):

a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.

for the garbage collection stuff you mentioned on how long to keep the messages, we already have messageBound to limit the length of a topic. We don't need to worry about it.

for your second question, it might be nice to have the queue semantic in Hedwig, since JMS implementation needs it. But implementing the queue semantic is totally a different story than pub/sub.

-Sijie


On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com>> wrote:
Hi guys,
   Under current Hedwig semantic, a subscriber cannot aware of messages published before he subscribes the topic. So in following example, subscriber A can only receives messages after seqId 2.
---------------------------------
Topic T: msg1 msg2 msg3 msg4 ...
                     | <- subscriber A subscribe the topic
---------------------------------

   This semantic is very reasonable, but Hedwig client needs to handle this corner case: a new topic is just to be created, and as topic is lazily created by the first request (generally it's PUB or SUB), so the client side must coordinate between publisher and subscriber to make sure the first SUB is handled before the first PUB at this very beginning status (consider subscriber may have very bad network connection which causes SUB failed and user does not want to miss any messages). In summary, it requires special works if there is a subscriber would like to receive all the messages since topic is created, and I think this requirement is very general.

   Handle this problem in client side is a choice, but I think maybe we can simply resolve it  in server side if Hedwig can support queue semantic (so that we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). And as I known, the major concern on queue semantic is how long to keep the messages, however:
   1. It is user's responsibility to know about the feature and impact of queue semantic.
   2. On the other hand, we can add a parameter to limit the queue length.

   In a word, here are the two problem I would like to discuss:
   1. How to gracefully resolve the above issue in server side under current semantic.
   2. Whether or not to introduce queue semantic into Hedwig.

Thanks,
Jiannan


Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Jiannan Wang <ji...@yahoo-inc.com>.
Hi Sijie,
   Thanks for well explaining on the difference between pub/sub model and queue model, I did confuse on them when there is only one subscriber on topic, I just want to invoke queue semantic to get around the problem :)

--------------------
two ideas could be proceed to resolve it (similar as what kafka did):
1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.
2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):
a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.
--------------------
I see the ConsumerConfig class in kafka's api but cannot find related option.
For idea 1), we also need to change current message garbage collection behavior in Hedwig: for topic with no subscriber just keep the message with messageBound limit. I in favor of this solution.
idea 2) is cool though it requires large changes compare to 1).

I see Flavio's reply to Yannick which suggests using ZooKeeper to coordinate the actions of publisher and subscriber. But it's a client-side solution, I would prefer solution 1) in Sijie's proposal which requires no special works in client side.

Thanks,
Jiannan


From: Sijie Guo <gu...@gmail.com>>
Reply-To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Date: Thursday, February 21, 2013 4:50 PM
To: "bookkeeper-dev@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Cc: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>, Hang Qi <ha...@yahoo-inc.com>>, Hongjian Chen <ho...@yahoo-inc.com>>, Bizhu Qiu <qi...@yahoo-inc.com>>, Fangmin Lv <lv...@yahoo-inc.com>>, Lin Shen <sh...@yahoo-inc.com>>
Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Thanks Jiannan for raising the discussion of queue semantic. There was some other guys in the mail list asked for queue semantic before.

Basically, topic (pub/sub) is quite different from queue in messaging concepts. In pub/sub model, when a publisher publish a message, it goes to all the consumers (subscribers) who are interested; while a queue model implements a load balancer semantic. A single message would be consumed almost exactly by one consumer. It means that a queue has many consumers with messages load balanced across the available consumers.

If the application requires all consumers seen same view of published messages, a topic is better for it. If the application doesn't matter who would receive and consume the published messages, a queue is better. But these two concepts become similar when there are only one consumer. It might make you confused on using a queue or a topic.

for your case, it is still a pub/sub application. so your first question is how to handle this case gracefully in a pub/sub model. two ideas could be proceed to resolve it (similar as what kafka did):

1) have a subscription option to indicate subscribe starting from the latest sequence id or the oldest sequence id.

2) let subscriber managed its consumed ptr and passed the consumed ptr back when subscribe to tell hub server where to start delivery. this subscriber could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):

a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying already consumed messages again.

for the garbage collection stuff you mentioned on how long to keep the messages, we already have messageBound to limit the length of a topic. We don't need to worry about it.

for your second question, it might be nice to have the queue semantic in Hedwig, since JMS implementation needs it. But implementing the queue semantic is totally a different story than pub/sub.

-Sijie


On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com>> wrote:
Hi guys,
   Under current Hedwig semantic, a subscriber cannot aware of messages published before he subscribes the topic. So in following example, subscriber A can only receives messages after seqId 2.
---------------------------------
Topic T: msg1 msg2 msg3 msg4 ...
                     | <- subscriber A subscribe the topic
---------------------------------

   This semantic is very reasonable, but Hedwig client needs to handle this corner case: a new topic is just to be created, and as topic is lazily created by the first request (generally it's PUB or SUB), so the client side must coordinate between publisher and subscriber to make sure the first SUB is handled before the first PUB at this very beginning status (consider subscriber may have very bad network connection which causes SUB failed and user does not want to miss any messages). In summary, it requires special works if there is a subscriber would like to receive all the messages since topic is created, and I think this requirement is very general.

   Handle this problem in client side is a choice, but I think maybe we can simply resolve it  in server side if Hedwig can support queue semantic (so that we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). And as I known, the major concern on queue semantic is how long to keep the messages, however:
   1. It is user's responsibility to know about the feature and impact of queue semantic.
   2. On the other hand, we can add a parameter to limit the queue length.

   In a word, here are the two problem I would like to discuss:
   1. How to gracefully resolve the above issue in server side under current semantic.
   2. Whether or not to introduce queue semantic into Hedwig.

Thanks,
Jiannan


Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Sijie Guo <gu...@gmail.com>.
Thanks Jiannan for raising the discussion of queue semantic. There was some
other guys in the mail list asked for queue semantic before.

Basically, topic (pub/sub) is quite different from queue in messaging
concepts. In pub/sub model, when a publisher publish a message, it goes to
all the consumers (subscribers) who are interested; while a queue model
implements a load balancer semantic. A single message would be consumed
almost exactly by one consumer. It means that a queue has many consumers
with messages load balanced across the available consumers.

If the application requires all consumers seen same view of published
messages, a topic is better for it. If the application doesn't matter who
would receive and consume the published messages, a queue is better. But
these two concepts become similar when there are only one consumer. It
might make you confused on using a queue or a topic.

for your case, it is still a pub/sub application. so your first question is
how to handle this case gracefully in a pub/sub model. two ideas could be
proceed to resolve it (similar as what kafka did):

1) have a subscription option to indicate subscribe starting from the
latest sequence id or the oldest sequence id.

2) let subscriber managed its consumed ptr and passed the consumed ptr back
when subscribe to tell hub server where to start delivery. this subscriber
could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):

a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying
already consumed messages again.

for the garbage collection stuff you mentioned on how long to keep the
messages, we already have messageBound to limit the length of a topic. We
don't need to worry about it.

for your second question, it might be nice to have the queue semantic in
Hedwig, since JMS implementation needs it. But implementing the queue
semantic is totally a different story than pub/sub.

-Sijie


On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com> wrote:

> Hi guys,
>    Under current Hedwig semantic, a subscriber cannot aware of messages
> published before he subscribes the topic. So in following example,
> subscriber A can only receives messages after seqId 2.
> ---------------------------------
> Topic T: msg1 msg2 msg3 msg4 ...
>                      | <- subscriber A subscribe the topic
> ---------------------------------
>
>    This semantic is very reasonable, but Hedwig client needs to handle
> this corner case: a new topic is just to be created, and as topic is lazily
> created by the first request (generally it's PUB or SUB), so the client
> side must coordinate between publisher and subscriber to make sure the
> first SUB is handled before the first PUB at this very beginning status
> (consider subscriber may have very bad network connection which causes SUB
> failed and user does not want to miss any messages). In summary, it
> requires special works if there is a subscriber would like to receive all
> the messages since topic is created, and I think this requirement is very
> general.
>
>    Handle this problem in client side is a choice, but I think maybe we
> can simply resolve it  in server side if Hedwig can support queue semantic
> (so that we can also extend Hedwig JMS provider to support JMS queue in
> BOOKKEEPER-312). And as I known, the major concern on queue semantic is how
> long to keep the messages, however:
>    1. It is user's responsibility to know about the feature and impact of
> queue semantic.
>    2. On the other hand, we can add a parameter to limit the queue length.
>
>    In a word, here are the two problem I would like to discuss:
>    1. How to gracefully resolve the above issue in server side under
> current semantic.
>    2. Whether or not to introduce queue semantic into Hedwig.
>
> Thanks,
> Jiannan
>

Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Sijie Guo <gu...@gmail.com>.
Thanks Jiannan for raising the discussion of queue semantic. There was some
other guys in the mail list asked for queue semantic before.

Basically, topic (pub/sub) is quite different from queue in messaging
concepts. In pub/sub model, when a publisher publish a message, it goes to
all the consumers (subscribers) who are interested; while a queue model
implements a load balancer semantic. A single message would be consumed
almost exactly by one consumer. It means that a queue has many consumers
with messages load balanced across the available consumers.

If the application requires all consumers seen same view of published
messages, a topic is better for it. If the application doesn't matter who
would receive and consume the published messages, a queue is better. But
these two concepts become similar when there are only one consumer. It
might make you confused on using a queue or a topic.

for your case, it is still a pub/sub application. so your first question is
how to handle this case gracefully in a pub/sub model. two ideas could be
proceed to resolve it (similar as what kafka did):

1) have a subscription option to indicate subscribe starting from the
latest sequence id or the oldest sequence id.

2) let subscriber managed its consumed ptr and passed the consumed ptr back
when subscribe to tell hub server where to start delivery. this subscriber
could be a special subscriber distinguished by a subscription option.

several benefits could be made by 2):

a) eliminate the storage and access of subscription metadata.
b) provided the mechanism to rewind the subscription back for replaying
already consumed messages again.

for the garbage collection stuff you mentioned on how long to keep the
messages, we already have messageBound to limit the length of a topic. We
don't need to worry about it.

for your second question, it might be nice to have the queue semantic in
Hedwig, since JMS implementation needs it. But implementing the queue
semantic is totally a different story than pub/sub.

-Sijie


On Wed, Feb 20, 2013 at 6:58 PM, Jiannan Wang <ji...@yahoo-inc.com> wrote:

> Hi guys,
>    Under current Hedwig semantic, a subscriber cannot aware of messages
> published before he subscribes the topic. So in following example,
> subscriber A can only receives messages after seqId 2.
> ---------------------------------
> Topic T: msg1 msg2 msg3 msg4 ...
>                      | <- subscriber A subscribe the topic
> ---------------------------------
>
>    This semantic is very reasonable, but Hedwig client needs to handle
> this corner case: a new topic is just to be created, and as topic is lazily
> created by the first request (generally it's PUB or SUB), so the client
> side must coordinate between publisher and subscriber to make sure the
> first SUB is handled before the first PUB at this very beginning status
> (consider subscriber may have very bad network connection which causes SUB
> failed and user does not want to miss any messages). In summary, it
> requires special works if there is a subscriber would like to receive all
> the messages since topic is created, and I think this requirement is very
> general.
>
>    Handle this problem in client side is a choice, but I think maybe we
> can simply resolve it  in server side if Hedwig can support queue semantic
> (so that we can also extend Hedwig JMS provider to support JMS queue in
> BOOKKEEPER-312). And as I known, the major concern on queue semantic is how
> long to keep the messages, however:
>    1. It is user's responsibility to know about the feature and impact of
> queue semantic.
>    2. On the other hand, we can add a parameter to limit the queue length.
>
>    In a word, here are the two problem I would like to discuss:
>    1. How to gracefully resolve the above issue in server side under
> current semantic.
>    2. Whether or not to introduce queue semantic into Hedwig.
>
> Thanks,
> Jiannan
>

Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Jiannan Wang <ji...@yahoo-inc.com>.
Thanks Yannick for your reply, I go over mailing list and find you had mentioned the problem before.

The two points your talk is interesting:
   - I think message "life time" is another topic, our problem is on the consume pointer not on message live time (even message is live but the sub is after the message, subscriber still cannot receive message). But it's awesome if Hedwig could support this feature.
   - Message "minimum consummation number", I guess you mean "minimum consummation number by different subscribers" which changes current message garbage collection strategy. It can resolve the problem but I don't know if it's worth to add this feature, is there any other usage scenario?

- Jiannan


From: Yannick Legros <ya...@gmail.com>>
Reply-To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Date: Thursday, February 21, 2013 4:25 PM
To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Hi Jiannan

I already face to this problem with Hedwig and i told this on this mailing list few weeks ago.
And yes the main problem is too coordinate publisher and subscriber to make sure the first SUB is handled before the first PUB.

just to add my contribution to this reflection :

1.  I think about some "life time" or "minumum consumation number" parameters for messages.

Regards,
Yannick.




2013/2/21 Jiannan Wang <ji...@yahoo-inc.com>>
Hi guys,
   Under current Hedwig semantic, a subscriber cannot aware of messages published before he subscribes the topic. So in following example, subscriber A can only receives messages after seqId 2.
---------------------------------
Topic T: msg1 msg2 msg3 msg4 ...
                     | <- subscriber A subscribe the topic
---------------------------------

   This semantic is very reasonable, but Hedwig client needs to handle this corner case: a new topic is just to be created, and as topic is lazily created by the first request (generally it's PUB or SUB), so the client side must coordinate between publisher and subscriber to make sure the first SUB is handled before the first PUB at this very beginning status (consider subscriber may have very bad network connection which causes SUB failed and user does not want to miss any messages). In summary, it requires special works if there is a subscriber would like to receive all the messages since topic is created, and I think this requirement is very general.

   Handle this problem in client side is a choice, but I think maybe we can simply resolve it in server side if Hedwig can support queue semantic (so that we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). And as I known, the major concern on queue semantic is how long to keep the messages, however:
   1. It is user's responsibility to know about the feature and impact of queue semantic.
   2. On the other hand, we can add a parameter to limit the queue length.

   In a word, here are the two problem I would like to discuss:
   1. How to gracefully resolve the above issue in server side under current semantic.
   2. Whether or not to introduce queue semantic into Hedwig.

Thanks,
Jiannan


Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Jiannan Wang <ji...@yahoo-inc.com>.
Thanks Yannick for your reply, I go over mailing list and find you had mentioned the problem before.

The two points your talk is interesting:
   - I think message "life time" is another topic, our problem is on the consume pointer not on message live time (even message is live but the sub is after the message, subscriber still cannot receive message). But it's awesome if Hedwig could support this feature.
   - Message "minimum consummation number", I guess you mean "minimum consummation number by different subscribers" which changes current message garbage collection strategy. It can resolve the problem but I don't know if it's worth to add this feature, is there any other usage scenario?

- Jiannan


From: Yannick Legros <ya...@gmail.com>>
Reply-To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Date: Thursday, February 21, 2013 4:25 PM
To: "bookkeeper-user@zookeeper.apache.org<ma...@zookeeper.apache.org>" <bo...@zookeeper.apache.org>>
Subject: Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Hi Jiannan

I already face to this problem with Hedwig and i told this on this mailing list few weeks ago.
And yes the main problem is too coordinate publisher and subscriber to make sure the first SUB is handled before the first PUB.

just to add my contribution to this reflection :

1.  I think about some "life time" or "minumum consumation number" parameters for messages.

Regards,
Yannick.




2013/2/21 Jiannan Wang <ji...@yahoo-inc.com>>
Hi guys,
   Under current Hedwig semantic, a subscriber cannot aware of messages published before he subscribes the topic. So in following example, subscriber A can only receives messages after seqId 2.
---------------------------------
Topic T: msg1 msg2 msg3 msg4 ...
                     | <- subscriber A subscribe the topic
---------------------------------

   This semantic is very reasonable, but Hedwig client needs to handle this corner case: a new topic is just to be created, and as topic is lazily created by the first request (generally it's PUB or SUB), so the client side must coordinate between publisher and subscriber to make sure the first SUB is handled before the first PUB at this very beginning status (consider subscriber may have very bad network connection which causes SUB failed and user does not want to miss any messages). In summary, it requires special works if there is a subscriber would like to receive all the messages since topic is created, and I think this requirement is very general.

   Handle this problem in client side is a choice, but I think maybe we can simply resolve it in server side if Hedwig can support queue semantic (so that we can also extend Hedwig JMS provider to support JMS queue in BOOKKEEPER-312). And as I known, the major concern on queue semantic is how long to keep the messages, however:
   1. It is user's responsibility to know about the feature and impact of queue semantic.
   2. On the other hand, we can add a parameter to limit the queue length.

   In a word, here are the two problem I would like to discuss:
   1. How to gracefully resolve the above issue in server side under current semantic.
   2. Whether or not to introduce queue semantic into Hedwig.

Thanks,
Jiannan


Re: [Discussion] [Hedwig] Add queue semantic support for Hedwig

Posted by Yannick Legros <ya...@gmail.com>.
Hi Jiannan

I already face to this problem with Hedwig and i told this on this mailing
list few weeks ago.
And yes the main problem is too coordinate publisher and subscriber to make
sure the first SUB is handled before the first PUB.

just to add my contribution to this reflection :

1.  I think about some "life time" or "minumum consumation number"
parameters for messages.

Regards,
Yannick.




2013/2/21 Jiannan Wang <ji...@yahoo-inc.com>

> Hi guys,
>    Under current Hedwig semantic, a subscriber cannot aware of messages
> published before he subscribes the topic. So in following example,
> subscriber A can only receives messages after seqId 2.
> ---------------------------------
> Topic T: msg1 msg2 msg3 msg4 ...
>                      | <- subscriber A subscribe the topic
> ---------------------------------
>
>    This semantic is very reasonable, but Hedwig client needs to handle
> this corner case: a new topic is just to be created, and as topic is lazily
> created by the first request (generally it's PUB or SUB), so the client
> side must coordinate between publisher and subscriber to make sure the
> first SUB is handled before the first PUB at this very beginning status
> (consider subscriber may have very bad network connection which causes SUB
> failed and user does not want to miss any messages). In summary, it
> requires special works if there is a subscriber would like to receive all
> the messages since topic is created, and I think this requirement is very
> general.
>
>    Handle this problem in client side is a choice, but I think maybe we
> can simply resolve it in server side if Hedwig can support queue semantic
> (so that we can also extend Hedwig JMS provider to support JMS queue in
> BOOKKEEPER-312). And as I known, the major concern on queue semantic is how
> long to keep the messages, however:
>    1. It is user's responsibility to know about the feature and impact of
> queue semantic.
>    2. On the other hand, we can add a parameter to limit the queue length.
>
>    In a word, here are the two problem I would like to discuss:
>    1. How to gracefully resolve the above issue in server side under
> current semantic.
>    2. Whether or not to introduce queue semantic into Hedwig.
>
> Thanks,
> Jiannan
>