You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@rocketmq.apache.org by 李 德鑫 <de...@outlook.com> on 2018/03/08 00:31:56 UTC

答复: [GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism

Hi Sohaib,


I‘m a student applying for GSOC too. And I've read all of your discussion in the mail list.

I have some questions about your design, and some of the questions may need to be answered by RocketMQ team. So I send them here to be discussed.

I don't think using key store to persist all the messages is a good idea. Since MQ is based on O(1) data structure. The key store would harm the performance.

I think we can learn from TCP protocol.

In Producer-Broker Communication, we can give an incremental id for every message sent in the same session. And the session id should be persistent on the disk for producer. So the broker only need to maintain a map between session id to expected message id(And this is how Kafka do it). Since messages are much more than producers. However, there's still a K/V store needed. So we have to ask RocketMQ team about how many producers in the same time while in practical situation.

Also, the same idea in Consumer-Broker Communication.


About consensus algorithm, I think RocketMQ should already have an implementation there. I don't know what it is, but maybe you can reuse that. Or what if you have to implement one, in my opinion, there's no need to implement both Paxos and Raft. Since they solve the same kind of problems.



Regards,

Dexin


________________________________
发件人: Sohaib Iftikhar <so...@gmail.com>
发送时间: 2018年3月7日 18:15:51
收件人: dev@rocketmq.apache.org
主题: Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism

Hi Yukon,

Thanks for your reply. Yes, it would be nice to concretely define the scope
of this project as the doc is a bit ambitious for just a summer. Should you
(or anyone else) have questions/suggestions/clarifications I'd be glad to
discuss more details.

Thanks,
Sohaib

On Wed, Mar 7, 2018 at 8:58 AM, yukon <yu...@apache.org> wrote:

> Hi,
>
> Google doc is better for discussion, your design is great, now we could
> discuss more details base on it.
>
> Any advice is welcome from RocketMQ community.
>
> Appreciate your efforts.
>
> Regards,
> yukon
>
> On Wed, Mar 7, 2018 at 5:15 AM, Sohaib Iftikhar <so...@gmail.com>
> wrote:
>
> > Hi,
> >
> > @Yukon Thank you for your reply. This clears some doubts.
> >
> > Sorry for the delay as I was somewhat occupied with another project. I
> have
> > created an initial design doc. Email is a bit cumbersome for feedback I
> > wrote this document in two formats:
> >
> > 1. In the form of a Google document:
> > https://docs.google.com/document/d/1KSpXGNDH0HF5E27lfKJxJnjIjPtlP
> > 1Q-M6rj3yZde24.
> > The document is open for comments to all users without signing in. I
> would
> > appreciate it if you put your name before the comment so I can identify
> who
> > to follow up the discussion with.
> >
> > 2. As a markdown on github:
> > https://github.com/sohaibiftikhar/rocketmq/blob/
> gsoc_design/gsoc_design.md
> > .
> > The comments for this can be made on the commit:
> > https://github.com/sohaibiftikhar/rocketmq/commit/
> > dfd55fc69f430fc024217a3b20dde31717334e62
> >
> > After I have received a certain amount of feedback I will try to
> > incorporate it and put in a subsequent version for review. Please tell me
> > which methods suits you better (gdoc or github) for review and we can
> > continue with that for the subsequent versions.
> >
> > Lastly, the document is a couple of pages so I appreciate your patience
> and
> > your help.
> > Looking forward to your opinions.
> >
> > Thanks,
> > Sohaib
> >
> > On Mon, Mar 5, 2018 at 1:01 PM, yukon <yu...@apache.org> wrote:
> >
> > > Hi Sohaib,
> > >
> > > Sorry for the late reply, we could move this project forward now ~
> > >
> > > ```
> > > I would at some point like to post
> > > design ideas to this problem privately to get it reviewed by the
> > > development community but not make it publicly available so that it
> > cannot
> > > be plagiarised.
> > > ```
> > >
> > > You can send your design ideas to me directly or to our PMC list(
> > > private@rocketmq.apache.org) if you want to make your ideas privately.
> > But
> > > please don't break away from the community.
> > >
> > > I hope you have already understood the goal of this project. Now,
> > RocketMQ
> > > support At-least-once delivery, it's an obvious solution
> > > that achieves Exactly-Once by removing duplicated messages.
> > >
> > > Return to your original questions:
> > >
> > > 1. What defines a redundant message?
> > >
> > > A message id will be generated when new a message, so this id can be
> used
> > > to identify a message. Also, the user could specify a unique
> > > business-related property to identify a message.
> > >
> > > The redundant messages will occur when the network is broken or
> > > reconnected, rebalance[1] is triggered, etc.
> > >
> > >
> > > 2. Is their a timeline on the redundant messages?
> > >
> > > Yes, keep all messages nonredundant is expensive, let's consider this
> > > question within a certain time window ~
> > >
> > > Looking forward to your design.
> > >
> > > [1].
> > > https://github.com/apache/rocketmq/blob/master/client/
> > > src/main/java/org/apache/rocketmq/client/impl/consumer/
> > > RebalanceService.java
> > >
> > >
> > > Regards,
> > > yukon
> > >
> > >
> > > On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <so...@gmail.com>
> > > wrote:
> > >
> > > > @Zhanhui Thanks for the response. This is not a campaign its just
> part
> > of
> > > > GSoC (https://summerofcode.withgoogle.com/). And community help is
> > > gladly
> > > > welcomed. In fact, it is recommended :)
> > > >
> > > > @KaiYuan Thanks for your suggestions. I will come up with a flow
> chart
> > > for
> > > > the proposed solution this weekend.
> > > >
> > > > Thanks,
> > > > Sohaib
> > > >
> > > >
> > > > On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <li...@gmail.com>
> > wrote:
> > > >
> > > > > Hi Sohaib,
> > > > >
> > > > > I have been sort of busy this these days. Sorry to reply you so
> late!
> > > > >
> > > > > So sure what “deadline” you are referring to. If this is part of a
> > > > > campaign, I have to admit I am not aware of the regulations and
> what
> > > kind
> > > > > of help I should offer to maintain fairness considering other
> arising
> > > > > similar issues.
> > > > >
> > > > > Regards!
> > > > >
> > > > > Zhanhui Li
> > > > >
> > > > >
> > > > > > 在 2018年3月1日,上午3:43,Sohaib Iftikhar <so...@gmail.com> 写道:
> > > > > >
> > > > > > Hi guys,
> > > > > >
> > > > > > Would be nice to have some feedback on this as the deadline is
> not
> > > too
> > > > > far :)
> > > > > >
> > > > > > Thanks,
> > > > > > Sohaib
> > > > > >
> > > > > > Regards,
> > > > > > Sohaib Iftikhar
> > > > > >
> > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <
> > > > sohaib1692@gmail.com
> > > > > <ma...@gmail.com>> wrote:
> > > > > > Thank you for the pointers to the code. This was super helpful.
> The
> > > > > multiple keys can probably be serialized better than separating
> them
> > > > with a
> > > > > space but that is already legacy I suppose.
> > > > > >
> > > > > > Firstly filters like bloom or cuckoo are heuristic. They can help
> > > make
> > > > > things faster but definitely cannot be used as the only solution.
> > > Hence,
> > > > in
> > > > > the end, we will still need a persistent keystore/distributed set.
> My
> > > > plan
> > > > > was to have this keystore as distributed (raft guarantee etc.). The
> > > > > keystore can also hold a persistent filter on its end. If a broker
> > > > > collapses it can renew/refresh its filter from the keystore. Hence
> > > > > eliminating the problems about crashes that you mention. The
> problem
> > > here
> > > > > could be in maintaining performance for filters in case of removals
> > > from
> > > > > the keystore (for eg: sliding windows as mentioned in my previous
> > > mail).
> > > > > Periodic refreshal of filters can help solve this but I am open to
> > > > > suggestions on how to make this better.
> > > > > >
> > > > > > I think implementing a distributed set on the client cluster has
> > its
> > > > > caveats. The way I understand RocketMQ is that we do not have
> control
> > > > over
> > > > > the diskspace/memory on the client end. So we probably only have a
> > > > constant
> > > > > amount. A distributed set on the client would also need to be
> > > persistent.
> > > > > For eg: if a client restarts/recovers etc. This basically means we
> > > need a
> > > > > keystore on the client instead of the broker cluster. This probably
> > > puts
> > > > > too much responsibility on the client cluster. A different approach
> > > would
> > > > > be to ensure that the offsets are always in sync with the broker.
> > Since
> > > > the
> > > > > broker only serves unique messages (based on the proposed solution
> on
> > > the
> > > > > producer/broker end) all we need to ensure is that a client does
> not
> > > > > consume messages with the same offset twice.
> > > > > >
> > > > > > Please suggest improvements if this does not look like the
> correct
> > > > > approach. Also would be great if someone can come up with a
> > completely
> > > > > different approach so that we can weigh up pros and cons.
> > > > > >
> > > > > > Thanks for reading this through and looking forward to your
> > opinions.
> > > > > >
> > > > > > Regards,
> > > > > > Sohaib
> > > > > >
> > > > > > Regards,
> > > > > > Sohaib Iftikhar
> > > > > >
> > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > >
> > > > > >
> > > > > > On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <lizhanhui@gmail.com
> > > > > <ma...@gmail.com>> wrote:
> > > > > > Hi Sohaib,
> > > > > >
> > > > > > About multiple key support, the following code snippet should
> > clarify
> > > > > your doubt:
> > > > > > org.apache.rocketmq.common.message.Message class has overloaded
> > > > setKeys
> > > > > methods, allowing your to set multiple keys via string(separated by
> > > > > space…sorry, we have not yet unified all separators, hoping this
> does
> > > not
> > > > > confuse you) or collection.
> > > > > >
> > > > > >
> > > > > > When broker tries to build index for the message with multiple
> > keys,
> > > > > multiple index entries are inserted into the indexing file.
> > > > > > See org.apache.rocketmq.store.index.IndexService#buildIndex
> > > > > >
> > > > > >
> > > > > > In terms of eliminating message duplication, personally, I wish
> we
> > > have
> > > > > a feature of exactly-once semantic covering the whole cluster and
> the
> > > > > complete send-store-consume processes. A rough idea is route the
> > > message
> > > > > according to its unique key to a broker according to a rule; The
> > > serving
> > > > > broker ensures uniqueness of the message according to the key( as
> you
> > > > said,
> > > > > bloom-filter/cuckoo-filter, etc);  Things might looks simple, but
> > > issues
> > > > > resides in scenarios where cluster is experiencing membership
> > changes:
> > > > for
> > > > > example, what if a broker crashed down? We might need propagate
> > > > > bloom-filter bitset synchronously to other brokers having the same
> > > > topics;
> > > > > What if a new broker joins in the cluster and starts to serve? I do
> > not
> > > > > mean this is too complex to implement. Instead, this is a pretty
> > > > > interesting topic and fancy feature to have. Alternatively, we
> might
> > > > defer
> > > > > eliminating duplicates to the consumption phase using kind of
> > > distributed
> > > > > set. For sure, my proposing idea suffers the same challenges
> > including
> > > > > membership changes.
> > > > > >
> > > > > > Guys of dev board, any insights on this issue?
> > > > > >
> > > > > > Zhanhui Li
> > > > > >
> > > > > >
> > > > > >> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1692@gmail.com
> > <mailto:
> > > > > sohaib1692@gmail.com>> 写道:
> > > > > >>
> > > > > >> Hi Zhanhui,
> > > > > >>
> > > > > >> I have a doubt about these multiple keys. If I am wrong in any
> of
> > > the
> > > > > >> assumptions I make please point it out.
> > > > > >>
> > > > > >> If there is support for multiple keys I cannot see this in the
> > code.
> > > > The
> > > > > >> class Message only stores a single key in the property map
> against
> > > the
> > > > > >> property name "KEYS". Is this also done in the same ways as
> tags?
> > > That
> > > > > is
> > > > > >> different keys are separated with ' || '? So basically as a user
> > of
> > > > the
> > > > > >> producer API it is the user's responsibility to ensure that he
> > > > separates
> > > > > >> the different keys with the correct separator. I can see an
> > obvious
> > > > > problem
> > > > > >> here. What if the key contains this special character ' || '?
> But
> > > > maybe
> > > > > >> this event is rare and hence this is not important. Could you
> > point
> > > me
> > > > > to
> > > > > >> some source/doc that explains this part? I was looking at the
> > index
> > > > > section
> > > > > >> rocketmq-store but I have not been able to understand the
> indexing
> > > > > process
> > > > > >> completely for now. I will keep reading the source to get a
> better
> > > > idea.
> > > > > >>
> > > > > >> Moving on to the implementational details. Here is a broad idea
> of
> > > one
> > > > > >> possible way to approach it.
> > > > > >>
> > > > > >> The attempt is to remove duplicate messages. In this issue, I
> > would
> > > > > like to
> > > > > >> aim at eliminating duplicate messages at the producer/broker
> end.
> > > For
> > > > > now,
> > > > > >> we do not concern ourselves with the duplicate messages
> happening
> > > due
> > > > to
> > > > > >> unwritten consumer offsets as these two issues have different
> > > > solutions.
> > > > > >> One way to solve this problem at the producer/broker end could
> be
> > to
> > > > > have a
> > > > > >> distributed key store that stores the messages. We can make it
> > > > > configurable
> > > > > >> such that this distributed store stores all messages or works
> as a
> > > > > sliding
> > > > > >> window keeping only the messages from the last X seconds
> specified
> > > by
> > > > > the
> > > > > >> user. We can have a layer on top to check set membership such
> as a
> > > > bloom
> > > > > >> filter or a cuckoo filter (
> > > > > >> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf <
> > > > > https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>) to help
> > > > > >> performance. Every message being pushed in by a producer are
> > checked
> > > > in
> > > > > >> first with the filter and in case of a positive result with this
> > key
> > > > > store.
> > > > > >> If the message is found then it is discarded. This helps remove
> > > > > duplicates
> > > > > >> completely from a producer perspective. The core of this idea is
> > the
> > > > > >> distributed key store which would be completely separate from
> the
> > > > > current
> > > > > >> message storage. Since the concept of a distributed key store
> or a
> > > > > >> key/value store is not novel there are two ways to this.
> > > > > >> 1. Implement it ourselves. This would be high effort but no
> > external
> > > > > >> dependencies.
> > > > > >> 2. Use a key-value store such as Redis (which already has
> timeouts
> > > and
> > > > > >> persistence but a large memory footprint) or some other
> disk-based
> > > > > storage
> > > > > >> for set membership. This would include an external dependency
> but
> > > > > >> development time will reduce significantly for such a solution.
> > > > > >> I am inclined towards implementing it by myself as this would
> > avoid
> > > > > >> dependencies on other products especially since RocketMQ is
> > > currently
> > > > a
> > > > > >> self-reliant system. In addition, my past experience with
> building
> > > > such
> > > > > a
> > > > > >> store should also come in handy.
> > > > > >>
> > > > > >> I would like to know the opinions of the development community
> on
> > > this
> > > > > >> approach and to suggest improvements on it. Looking forward to
> > your
> > > > > >> responses to this.
> > > > > >>
> > > > > >> ====<question unrelated to issue>=====
> > > > > >> To increase my familiarity with the code base and to help prove
> > > that I
> > > > > am
> > > > > >> familiar with the tools and technologies in place it would be
> > great
> > > > if I
> > > > > >> could be pointed to some low effort issues that I could help out
> > > with.
> > > > > In
> > > > > >> case there are no 'newbie' issues available I could help improve
> > the
> > > > > >> comments inside the codebase. I noticed some source files with
> no
> > > > > >> explanations which can be documented via comments to help
> onboard
> > a
> > > > new
> > > > > >> contributor faster.
> > > > > >> ====</question unrelated to issue>=====
> > > > > >>
> > > > > >> Thanks a lot for reading this through and looking forward to
> your
> > > > > opinions.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Sohaib
> > > > > >>
> > > > > >>
> > > > > >> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <
> lizhanhui@gmail.com
> > > > > <ma...@gmail.com>> wrote:
> > > > > >>
> > > > > >>> Hi Sohaib,
> > > > > >>>
> > > > > >>> Happy to know you are interested in RocketMQ.
> > > > > >>>
> > > > > >>> First, let me answer questions you raised.
> > > > > >>>
> > > > > >>> ― can there be multiple tags?
> > > > > >>> No. At present, the storage engine allows single tag only.
> > > > > Subscriptions
> > > > > >>> are allowed to use combination of tags. The current model
> should
> > > meet
> > > > > your
> > > > > >>> business development. If not, please let us know.
> > > > > >>>
> > > > > >>>
> > > > > >>> ― key (Similar question to above.)
> > > > > >>> RocketMQ builds index using message keys. A single message may
> > have
> > > > > >>> multiple keys.
> > > > > >>>
> > > > > >>> ― About redundant message
> > > > > >>> From my understanding, you are trying to eliminate duplicate
> > > > messages.
> > > > > >>> True there are various reasons which may cause message
> > duplication,
> > > > > ranging
> > > > > >>> from message delivery and consumption. Discussion on this topic
> > is
> > > > > warmly
> > > > > >>> welcome.  Had you had any idea to contribute on this issue, the
> > > > > developer
> > > > > >>> board is happy to discuss.
> > > > > >>>
> > > > > >>> Zhanhui Li
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <sohaib1692@gmail.com
> > > <mailto:
> > > > > sohaib1692@gmail.com>> 写道:
> > > > > >>>>
> > > > > >>>> My earlier email message seems to have gotten lost. So I will
> > try
> > > > > again.
> > > > > >>>> Please see the original message for the discussion.
> > > > > >>>>
> > > > > >>>> Regards,
> > > > > >>>> Sohaib Iftikhar
> > > > > >>>>
> > > > > >>>> -- Man is still the most extraordinary computer of all.--
> > > > > >>>>
> > > > > >>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <
> > > > > sohaib1692@gmail.com <ma...@gmail.com>>
> > > > > >>>> wrote:
> > > > > >>>>
> > > > > >>>>> Hi,
> > > > > >>>>>
> > > > > >>>>> I am interested in working on this issue (
> > > > https://issues.apache.org/
> > > > > <https://issues.apache.org/>
> > > > > >>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a few
> > > > questions
> > > > > for
> > > > > >>>>> the same. I am not sure if this discussion needs to be on the
> > > JIRA
> > > > > >>> issue or
> > > > > >>>>> here. Feel free to correct me if this is the wrong platform.
> > Also
> > > > > while
> > > > > >>> I
> > > > > >>>>> have worked with distributed pub-sub systems I am still
> fairly
> > > new
> > > > to
> > > > > >>>>> Rocket-MQ so maybe my understanding of it is incorrect. I
> > > apologise
> > > > > if
> > > > > >>> that
> > > > > >>>>> is the case and would be happy to stand corrected.
> > > > > >>>>>
> > > > > >>>>> Following are my questions:
> > > > > >>>>> 1. What defines a redundant message?
> > > > > >>>>>   The constructor that I see for a message is as follows:
> > > > > >>>>>   Message(String topic, String tags, String keys, int flag,
> > > byte[]
> > > > > >>> body,
> > > > > >>>>> boolean waitStoreMsgOK)
> > > > > >>>>>   Possible candidates to me are topic, tags (can there be
> > > multiple
> > > > > >>> tags?
> > > > > >>>>> I could not find an example for this. If yes how are they
> > > > > separated?),
> > > > > >>> keys
> > > > > >>>>> (Similar question to above.) and of course the body. Is there
> > > > > something
> > > > > >>>>> that I have missed in this? Is there something that we do not
> > > need
> > > > to
> > > > > >>>>> consider?
> > > > > >>>>> 2. Is their a timeline on the redundant messages? What I mean
> > by
> > > > > this is
> > > > > >>>>> that is there a time limit after which a message with similar
> > > > > content is
> > > > > >>>>> allowed. From what I gather there was no such thing
> mentioned.
> > > This
> > > > > >>> would
> > > > > >>>>> mean storing all the messages. Depending on the requirements
> > this
> > > > > may or
> > > > > >>>>> may not be the best solution. It might be desirable that no
> > > > > duplicates
> > > > > >>> are
> > > > > >>>>> needed within a certain time window (sliding). This allows
> > > ignoring
> > > > > of
> > > > > >>>>> duplicate messages that were generated very close to each
> other
> > > (or
> > > > > in
> > > > > >>> the
> > > > > >>>>> window indicated). Depending on this requirement
> implementation
> > > may
> > > > > >>> become
> > > > > >>>>> a little bit more involved.
> > > > > >>>>>
> > > > > >>>>> For now, these are the only questions. I have ideas that need
> > > > review
> > > > > >>> about
> > > > > >>>>> possible implementations but I will mention them once the
> > > > > specifications
> > > > > >>>>> are clear to me. As an end question, I would at some point
> like
> > > to
> > > > > post
> > > > > >>>>> design ideas to this problem privately to get it reviewed by
> > the
> > > > > >>>>> development community but not make it publicly available so
> > that
> > > it
> > > > > >>> cannot
> > > > > >>>>> be plagiarised. What platform/method can I use to do that? Or
> > is
> > > > > >>> submitting
> > > > > >>>>> a draft to the Google platform the only possible way to
> > > accomplish
> > > > > this?
> > > > > >>>>>
> > > > > >>>>> Thanks a lot for reading this through and looking forward to
> > your
> > > > > >>> inputs.
> > > > > >>>>>
> > > > > >>>>> Regards,
> > > > > >>>>> Sohaib Iftikhar
> > > > > >>>>>
> > > > > >>>
> > > > > >>>
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism

Posted by Sohaib Iftikhar <so...@gmail.com>.
Hi Von,

Thank you for your suggestions. I have answered your queries on the doc.
After your replies, I will create a formal proposal as per Google guidelines
<https://google.github.io/gsocguides/student/writing-a-proposal> for review
as the submission period has now started.

Thanks,
Sohaib

On Mon, Mar 12, 2018 at 9:48 AM, Von Gosling <vo...@apache.org> wrote:

> Hi Sohaib,
>
> I have reviewed and made some suggestion for your concern problems.
>
> For all other GSoC students, Could we do some practice like Sohaib,
> looking forward to your proposal on the Google Doc.
>
> Best Regards,
> Von Gosling
>
> > 在 2018年3月9日,15:17,Sohaib Iftikhar <so...@gmail.com> 写道:
> >
> > Hi Yukon,
> >
> > What do you suggest for the key store itself? Do you propose writing this
> > ourselves or using some existing solution and writing a layer on top?
> >
> > Thanks,
> > Sohaib
> >
> > On Fri, Mar 9, 2018 at 6:20 AM, yukon <yu...@apache.org> wrote:
> >
> >> ```
> >> Personally, I find RAFT to be much simpler to implement. However, I do
> not
> >> expect to reinvent the wheel here.
> >> ```
> >>
> >> That's absolutely right, no need to reinvent the wheel, there are many
> >> existing implementations for raft: https://raft.github.io/
> >>
> >> ```
> >> I don't think using key store to persist all the messages is a good
> idea.
> >> ```
> >>
> >> Yes, store an ID is enough.
> >>
> >>
> >> On Thu, Mar 8, 2018 at 3:32 PM, Sohaib Iftikhar <so...@gmail.com>
> >> wrote:
> >>
> >>> Hi Dexin,
> >>>
> >>> Thank you for your suggestions. I will try to answer as much as I can
> and
> >>> leave the rest to the RocketMQ team.
> >>>
> >>> 1. The idea with incremental Ids is actually quite good. But @Yukon
> >>> mentioned that duplication can also be controlled by an application
> >>> (special KV Property) in which case different producers may produce the
> >>> same message that needs to deduplicated on the broker.
> >>> SessionId+IncrementalId won't work in this scenario I believe. But we
> can
> >>> actually switch to more efficient storage using the idea you described
> >> when
> >>> the user is not specifying these special keys.
> >>> Also I proposed storing of keys for only a fixed time interval. For all
> >>> practical purposes this would still remain constant time. [Log base 2
> of
> >>> 10^10 is still just 33 :) ]. It does add the extra cost of
> communication
> >>> but this would be the case in both scenarios.
> >>> 2. As for consensus, the ideas I presented were pretty abstract so I
> >>> mentioned a couple of algorithms that could potentially be used.
> >>> Personally, I find RAFT to be much simpler to implement. However, I do
> >> not
> >>> expect to reinvent the wheel here. I strongly believe that in this
> case,
> >> we
> >>> can build upon some tested existing solution.
> >>>
> >>>
> >>> Regards,
> >>> Sohaib
> >>>
> >>> On Thu, Mar 8, 2018 at 1:31 AM, 李 德鑫 <de...@outlook.com> wrote:
> >>>
> >>>> Hi Sohaib,
> >>>>
> >>>>
> >>>> I‘m a student applying for GSOC too. And I've read all of your
> >> discussion
> >>>> in the mail list.
> >>>>
> >>>> I have some questions about your design, and some of the questions may
> >>>> need to be answered by RocketMQ team. So I send them here to be
> >>> discussed.
> >>>>
> >>>> I don't think using key store to persist all the messages is a good
> >> idea.
> >>>> Since MQ is based on O(1) data structure. The key store would harm the
> >>>> performance.
> >>>>
> >>>> I think we can learn from TCP protocol.
> >>>>
> >>>> In Producer-Broker Communication, we can give an incremental id for
> >> every
> >>>> message sent in the same session. And the session id should be
> >> persistent
> >>>> on the disk for producer. So the broker only need to maintain a map
> >>> between
> >>>> session id to expected message id(And this is how Kafka do it). Since
> >>>> messages are much more than producers. However, there's still a K/V
> >> store
> >>>> needed. So we have to ask RocketMQ team about how many producers in
> the
> >>>> same time while in practical situation.
> >>>>
> >>>> Also, the same idea in Consumer-Broker Communication.
> >>>>
> >>>>
> >>>> About consensus algorithm, I think RocketMQ should already have an
> >>>> implementation there. I don't know what it is, but maybe you can reuse
> >>>> that. Or what if you have to implement one, in my opinion, there's no
> >>> need
> >>>> to implement both Paxos and Raft. Since they solve the same kind of
> >>>> problems.
> >>>>
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> Dexin
> >>>>
> >>>>
> >>>> ________________________________
> >>>> 发件人: Sohaib Iftikhar <so...@gmail.com>
> >>>> 发送时间: 2018年3月7日 18:15:51
> >>>> 收件人: dev@rocketmq.apache.org
> >>>> 主题: Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery
> >>>> mechanism
> >>>>
> >>>> Hi Yukon,
> >>>>
> >>>> Thanks for your reply. Yes, it would be nice to concretely define the
> >>> scope
> >>>> of this project as the doc is a bit ambitious for just a summer.
> Should
> >>> you
> >>>> (or anyone else) have questions/suggestions/clarifications I'd be
> glad
> >>> to
> >>>> discuss more details.
> >>>>
> >>>> Thanks,
> >>>> Sohaib
> >>>>
> >>>> On Wed, Mar 7, 2018 at 8:58 AM, yukon <yu...@apache.org> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> Google doc is better for discussion, your design is great, now we
> >> could
> >>>>> discuss more details base on it.
> >>>>>
> >>>>> Any advice is welcome from RocketMQ community.
> >>>>>
> >>>>> Appreciate your efforts.
> >>>>>
> >>>>> Regards,
> >>>>> yukon
> >>>>>
> >>>>> On Wed, Mar 7, 2018 at 5:15 AM, Sohaib Iftikhar <
> >> sohaib1692@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> @Yukon Thank you for your reply. This clears some doubts.
> >>>>>>
> >>>>>> Sorry for the delay as I was somewhat occupied with another
> >> project.
> >>> I
> >>>>> have
> >>>>>> created an initial design doc. Email is a bit cumbersome for
> >>> feedback I
> >>>>>> wrote this document in two formats:
> >>>>>>
> >>>>>> 1. In the form of a Google document:
> >>>>>> https://docs.google.com/document/d/1KSpXGNDH0HF5E27lfKJxJnjIjPtlP
> >>>>>> 1Q-M6rj3yZde24.
> >>>>>> The document is open for comments to all users without signing in.
> >> I
> >>>>> would
> >>>>>> appreciate it if you put your name before the comment so I can
> >>> identify
> >>>>> who
> >>>>>> to follow up the discussion with.
> >>>>>>
> >>>>>> 2. As a markdown on github:
> >>>>>> https://github.com/sohaibiftikhar/rocketmq/blob/
> >>>>> gsoc_design/gsoc_design.md
> >>>>>> .
> >>>>>> The comments for this can be made on the commit:
> >>>>>> https://github.com/sohaibiftikhar/rocketmq/commit/
> >>>>>> dfd55fc69f430fc024217a3b20dde31717334e62
> >>>>>>
> >>>>>> After I have received a certain amount of feedback I will try to
> >>>>>> incorporate it and put in a subsequent version for review. Please
> >>> tell
> >>>> me
> >>>>>> which methods suits you better (gdoc or github) for review and we
> >> can
> >>>>>> continue with that for the subsequent versions.
> >>>>>>
> >>>>>> Lastly, the document is a couple of pages so I appreciate your
> >>> patience
> >>>>> and
> >>>>>> your help.
> >>>>>> Looking forward to your opinions.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Sohaib
> >>>>>>
> >>>>>> On Mon, Mar 5, 2018 at 1:01 PM, yukon <yu...@apache.org> wrote:
> >>>>>>
> >>>>>>> Hi Sohaib,
> >>>>>>>
> >>>>>>> Sorry for the late reply, we could move this project forward now
> >> ~
> >>>>>>>
> >>>>>>> ```
> >>>>>>> I would at some point like to post
> >>>>>>> design ideas to this problem privately to get it reviewed by the
> >>>>>>> development community but not make it publicly available so that
> >> it
> >>>>>> cannot
> >>>>>>> be plagiarised.
> >>>>>>> ```
> >>>>>>>
> >>>>>>> You can send your design ideas to me directly or to our PMC list(
> >>>>>>> private@rocketmq.apache.org) if you want to make your ideas
> >>>> privately.
> >>>>>> But
> >>>>>>> please don't break away from the community.
> >>>>>>>
> >>>>>>> I hope you have already understood the goal of this project. Now,
> >>>>>> RocketMQ
> >>>>>>> support At-least-once delivery, it's an obvious solution
> >>>>>>> that achieves Exactly-Once by removing duplicated messages.
> >>>>>>>
> >>>>>>> Return to your original questions:
> >>>>>>>
> >>>>>>> 1. What defines a redundant message?
> >>>>>>>
> >>>>>>> A message id will be generated when new a message, so this id can
> >>> be
> >>>>> used
> >>>>>>> to identify a message. Also, the user could specify a unique
> >>>>>>> business-related property to identify a message.
> >>>>>>>
> >>>>>>> The redundant messages will occur when the network is broken or
> >>>>>>> reconnected, rebalance[1] is triggered, etc.
> >>>>>>>
> >>>>>>>
> >>>>>>> 2. Is their a timeline on the redundant messages?
> >>>>>>>
> >>>>>>> Yes, keep all messages nonredundant is expensive, let's consider
> >>> this
> >>>>>>> question within a certain time window ~
> >>>>>>>
> >>>>>>> Looking forward to your design.
> >>>>>>>
> >>>>>>> [1].
> >>>>>>> https://github.com/apache/rocketmq/blob/master/client/
> >>>>>>> src/main/java/org/apache/rocketmq/client/impl/consumer/
> >>>>>>> RebalanceService.java
> >>>>>>>
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> yukon
> >>>>>>>
> >>>>>>>
> >>>>>>> On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <
> >>>> sohaib1692@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> @Zhanhui Thanks for the response. This is not a campaign its
> >> just
> >>>>> part
> >>>>>> of
> >>>>>>>> GSoC (https://summerofcode.withgoogle.com/). And community
> >> help
> >>> is
> >>>>>>> gladly
> >>>>>>>> welcomed. In fact, it is recommended :)
> >>>>>>>>
> >>>>>>>> @KaiYuan Thanks for your suggestions. I will come up with a
> >> flow
> >>>>> chart
> >>>>>>> for
> >>>>>>>> the proposed solution this weekend.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Sohaib
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <
> >> lizhanhui@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hi Sohaib,
> >>>>>>>>>
> >>>>>>>>> I have been sort of busy this these days. Sorry to reply you
> >> so
> >>>>> late!
> >>>>>>>>>
> >>>>>>>>> So sure what “deadline” you are referring to. If this is part
> >>> of
> >>>> a
> >>>>>>>>> campaign, I have to admit I am not aware of the regulations
> >> and
> >>>>> what
> >>>>>>> kind
> >>>>>>>>> of help I should offer to maintain fairness considering other
> >>>>> arising
> >>>>>>>>> similar issues.
> >>>>>>>>>
> >>>>>>>>> Regards!
> >>>>>>>>>
> >>>>>>>>> Zhanhui Li
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> 在 2018年3月1日,上午3:43,Sohaib Iftikhar <so...@gmail.com>
> >>> 写道:
> >>>>>>>>>>
> >>>>>>>>>> Hi guys,
> >>>>>>>>>>
> >>>>>>>>>> Would be nice to have some feedback on this as the deadline
> >>> is
> >>>>> not
> >>>>>>> too
> >>>>>>>>> far :)
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Sohaib
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Sohaib Iftikhar
> >>>>>>>>>>
> >>>>>>>>>> -- Man is still the most extraordinary computer of all.--
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <
> >>>>>>>> sohaib1692@gmail.com
> >>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>> Thank you for the pointers to the code. This was super
> >>> helpful.
> >>>>> The
> >>>>>>>>> multiple keys can probably be serialized better than
> >> separating
> >>>>> them
> >>>>>>>> with a
> >>>>>>>>> space but that is already legacy I suppose.
> >>>>>>>>>>
> >>>>>>>>>> Firstly filters like bloom or cuckoo are heuristic. They
> >> can
> >>>> help
> >>>>>>> make
> >>>>>>>>> things faster but definitely cannot be used as the only
> >>> solution.
> >>>>>>> Hence,
> >>>>>>>> in
> >>>>>>>>> the end, we will still need a persistent keystore/distributed
> >>>> set.
> >>>>> My
> >>>>>>>> plan
> >>>>>>>>> was to have this keystore as distributed (raft guarantee
> >> etc.).
> >>>> The
> >>>>>>>>> keystore can also hold a persistent filter on its end. If a
> >>>> broker
> >>>>>>>>> collapses it can renew/refresh its filter from the keystore.
> >>>> Hence
> >>>>>>>>> eliminating the problems about crashes that you mention. The
> >>>>> problem
> >>>>>>> here
> >>>>>>>>> could be in maintaining performance for filters in case of
> >>>> removals
> >>>>>>> from
> >>>>>>>>> the keystore (for eg: sliding windows as mentioned in my
> >>> previous
> >>>>>>> mail).
> >>>>>>>>> Periodic refreshal of filters can help solve this but I am
> >> open
> >>>> to
> >>>>>>>>> suggestions on how to make this better.
> >>>>>>>>>>
> >>>>>>>>>> I think implementing a distributed set on the client
> >> cluster
> >>>> has
> >>>>>> its
> >>>>>>>>> caveats. The way I understand RocketMQ is that we do not have
> >>>>> control
> >>>>>>>> over
> >>>>>>>>> the diskspace/memory on the client end. So we probably only
> >>> have
> >>>> a
> >>>>>>>> constant
> >>>>>>>>> amount. A distributed set on the client would also need to be
> >>>>>>> persistent.
> >>>>>>>>> For eg: if a client restarts/recovers etc. This basically
> >> means
> >>>> we
> >>>>>>> need a
> >>>>>>>>> keystore on the client instead of the broker cluster. This
> >>>> probably
> >>>>>>> puts
> >>>>>>>>> too much responsibility on the client cluster. A different
> >>>> approach
> >>>>>>> would
> >>>>>>>>> be to ensure that the offsets are always in sync with the
> >>> broker.
> >>>>>> Since
> >>>>>>>> the
> >>>>>>>>> broker only serves unique messages (based on the proposed
> >>>> solution
> >>>>> on
> >>>>>>> the
> >>>>>>>>> producer/broker end) all we need to ensure is that a client
> >>> does
> >>>>> not
> >>>>>>>>> consume messages with the same offset twice.
> >>>>>>>>>>
> >>>>>>>>>> Please suggest improvements if this does not look like the
> >>>>> correct
> >>>>>>>>> approach. Also would be great if someone can come up with a
> >>>>>> completely
> >>>>>>>>> different approach so that we can weigh up pros and cons.
> >>>>>>>>>>
> >>>>>>>>>> Thanks for reading this through and looking forward to your
> >>>>>> opinions.
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Sohaib
> >>>>>>>>>>
> >>>>>>>>>> Regards,
> >>>>>>>>>> Sohaib Iftikhar
> >>>>>>>>>>
> >>>>>>>>>> -- Man is still the most extraordinary computer of all.--
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <
> >>>> lizhanhui@gmail.com
> >>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>> Hi Sohaib,
> >>>>>>>>>>
> >>>>>>>>>> About multiple key support, the following code snippet
> >> should
> >>>>>> clarify
> >>>>>>>>> your doubt:
> >>>>>>>>>> org.apache.rocketmq.common.message.Message class has
> >>>> overloaded
> >>>>>>>> setKeys
> >>>>>>>>> methods, allowing your to set multiple keys via
> >>> string(separated
> >>>> by
> >>>>>>>>> space…sorry, we have not yet unified all separators, hoping
> >>> this
> >>>>> does
> >>>>>>> not
> >>>>>>>>> confuse you) or collection.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> When broker tries to build index for the message with
> >>> multiple
> >>>>>> keys,
> >>>>>>>>> multiple index entries are inserted into the indexing file.
> >>>>>>>>>> See org.apache.rocketmq.store.
> >> index.IndexService#buildIndex
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> In terms of eliminating message duplication, personally, I
> >>> wish
> >>>>> we
> >>>>>>> have
> >>>>>>>>> a feature of exactly-once semantic covering the whole cluster
> >>> and
> >>>>> the
> >>>>>>>>> complete send-store-consume processes. A rough idea is route
> >>> the
> >>>>>>> message
> >>>>>>>>> according to its unique key to a broker according to a rule;
> >>> The
> >>>>>>> serving
> >>>>>>>>> broker ensures uniqueness of the message according to the
> >> key(
> >>> as
> >>>>> you
> >>>>>>>> said,
> >>>>>>>>> bloom-filter/cuckoo-filter, etc);  Things might looks simple,
> >>> but
> >>>>>>> issues
> >>>>>>>>> resides in scenarios where cluster is experiencing membership
> >>>>>> changes:
> >>>>>>>> for
> >>>>>>>>> example, what if a broker crashed down? We might need
> >> propagate
> >>>>>>>>> bloom-filter bitset synchronously to other brokers having the
> >>>> same
> >>>>>>>> topics;
> >>>>>>>>> What if a new broker joins in the cluster and starts to
> >> serve?
> >>> I
> >>>> do
> >>>>>> not
> >>>>>>>>> mean this is too complex to implement. Instead, this is a
> >>> pretty
> >>>>>>>>> interesting topic and fancy feature to have. Alternatively,
> >> we
> >>>>> might
> >>>>>>>> defer
> >>>>>>>>> eliminating duplicates to the consumption phase using kind of
> >>>>>>> distributed
> >>>>>>>>> set. For sure, my proposing idea suffers the same challenges
> >>>>>> including
> >>>>>>>>> membership changes.
> >>>>>>>>>>
> >>>>>>>>>> Guys of dev board, any insights on this issue?
> >>>>>>>>>>
> >>>>>>>>>> Zhanhui Li
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1692@gmail.com
> >>>>>> <mailto:
> >>>>>>>>> sohaib1692@gmail.com>> 写道:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Zhanhui,
> >>>>>>>>>>>
> >>>>>>>>>>> I have a doubt about these multiple keys. If I am wrong in
> >>> any
> >>>>> of
> >>>>>>> the
> >>>>>>>>>>> assumptions I make please point it out.
> >>>>>>>>>>>
> >>>>>>>>>>> If there is support for multiple keys I cannot see this in
> >>> the
> >>>>>> code.
> >>>>>>>> The
> >>>>>>>>>>> class Message only stores a single key in the property map
> >>>>> against
> >>>>>>> the
> >>>>>>>>>>> property name "KEYS". Is this also done in the same ways
> >> as
> >>>>> tags?
> >>>>>>> That
> >>>>>>>>> is
> >>>>>>>>>>> different keys are separated with ' || '? So basically as
> >> a
> >>>> user
> >>>>>> of
> >>>>>>>> the
> >>>>>>>>>>> producer API it is the user's responsibility to ensure
> >> that
> >>> he
> >>>>>>>> separates
> >>>>>>>>>>> the different keys with the correct separator. I can see
> >> an
> >>>>>> obvious
> >>>>>>>>> problem
> >>>>>>>>>>> here. What if the key contains this special character ' ||
> >>> '?
> >>>>> But
> >>>>>>>> maybe
> >>>>>>>>>>> this event is rare and hence this is not important. Could
> >>> you
> >>>>>> point
> >>>>>>> me
> >>>>>>>>> to
> >>>>>>>>>>> some source/doc that explains this part? I was looking at
> >>> the
> >>>>>> index
> >>>>>>>>> section
> >>>>>>>>>>> rocketmq-store but I have not been able to understand the
> >>>>> indexing
> >>>>>>>>> process
> >>>>>>>>>>> completely for now. I will keep reading the source to get
> >> a
> >>>>> better
> >>>>>>>> idea.
> >>>>>>>>>>>
> >>>>>>>>>>> Moving on to the implementational details. Here is a broad
> >>>> idea
> >>>>> of
> >>>>>>> one
> >>>>>>>>>>> possible way to approach it.
> >>>>>>>>>>>
> >>>>>>>>>>> The attempt is to remove duplicate messages. In this
> >> issue,
> >>> I
> >>>>>> would
> >>>>>>>>> like to
> >>>>>>>>>>> aim at eliminating duplicate messages at the
> >> producer/broker
> >>>>> end.
> >>>>>>> For
> >>>>>>>>> now,
> >>>>>>>>>>> we do not concern ourselves with the duplicate messages
> >>>>> happening
> >>>>>>> due
> >>>>>>>> to
> >>>>>>>>>>> unwritten consumer offsets as these two issues have
> >>> different
> >>>>>>>> solutions.
> >>>>>>>>>>> One way to solve this problem at the producer/broker end
> >>> could
> >>>>> be
> >>>>>> to
> >>>>>>>>> have a
> >>>>>>>>>>> distributed key store that stores the messages. We can
> >> make
> >>> it
> >>>>>>>>> configurable
> >>>>>>>>>>> such that this distributed store stores all messages or
> >>> works
> >>>>> as a
> >>>>>>>>> sliding
> >>>>>>>>>>> window keeping only the messages from the last X seconds
> >>>>> specified
> >>>>>>> by
> >>>>>>>>> the
> >>>>>>>>>>> user. We can have a layer on top to check set membership
> >>> such
> >>>>> as a
> >>>>>>>> bloom
> >>>>>>>>>>> filter or a cuckoo filter (
> >>>>>>>>>>> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
> >> <
> >>>>>>>>> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>)
> >> to
> >>>> help
> >>>>>>>>>>> performance. Every message being pushed in by a producer
> >> are
> >>>>>> checked
> >>>>>>>> in
> >>>>>>>>>>> first with the filter and in case of a positive result
> >> with
> >>>> this
> >>>>>> key
> >>>>>>>>> store.
> >>>>>>>>>>> If the message is found then it is discarded. This helps
> >>>> remove
> >>>>>>>>> duplicates
> >>>>>>>>>>> completely from a producer perspective. The core of this
> >>> idea
> >>>> is
> >>>>>> the
> >>>>>>>>>>> distributed key store which would be completely separate
> >>> from
> >>>>> the
> >>>>>>>>> current
> >>>>>>>>>>> message storage. Since the concept of a distributed key
> >>> store
> >>>>> or a
> >>>>>>>>>>> key/value store is not novel there are two ways to this.
> >>>>>>>>>>> 1. Implement it ourselves. This would be high effort but
> >> no
> >>>>>> external
> >>>>>>>>>>> dependencies.
> >>>>>>>>>>> 2. Use a key-value store such as Redis (which already has
> >>>>> timeouts
> >>>>>>> and
> >>>>>>>>>>> persistence but a large memory footprint) or some other
> >>>>> disk-based
> >>>>>>>>> storage
> >>>>>>>>>>> for set membership. This would include an external
> >>> dependency
> >>>>> but
> >>>>>>>>>>> development time will reduce significantly for such a
> >>>> solution.
> >>>>>>>>>>> I am inclined towards implementing it by myself as this
> >>> would
> >>>>>> avoid
> >>>>>>>>>>> dependencies on other products especially since RocketMQ
> >> is
> >>>>>>> currently
> >>>>>>>> a
> >>>>>>>>>>> self-reliant system. In addition, my past experience with
> >>>>> building
> >>>>>>>> such
> >>>>>>>>> a
> >>>>>>>>>>> store should also come in handy.
> >>>>>>>>>>>
> >>>>>>>>>>> I would like to know the opinions of the development
> >>> community
> >>>>> on
> >>>>>>> this
> >>>>>>>>>>> approach and to suggest improvements on it. Looking
> >> forward
> >>> to
> >>>>>> your
> >>>>>>>>>>> responses to this.
> >>>>>>>>>>>
> >>>>>>>>>>> ====<question unrelated to issue>=====
> >>>>>>>>>>> To increase my familiarity with the code base and to help
> >>>> prove
> >>>>>>> that I
> >>>>>>>>> am
> >>>>>>>>>>> familiar with the tools and technologies in place it would
> >>> be
> >>>>>> great
> >>>>>>>> if I
> >>>>>>>>>>> could be pointed to some low effort issues that I could
> >> help
> >>>> out
> >>>>>>> with.
> >>>>>>>>> In
> >>>>>>>>>>> case there are no 'newbie' issues available I could help
> >>>> improve
> >>>>>> the
> >>>>>>>>>>> comments inside the codebase. I noticed some source files
> >>> with
> >>>>> no
> >>>>>>>>>>> explanations which can be documented via comments to help
> >>>>> onboard
> >>>>>> a
> >>>>>>>> new
> >>>>>>>>>>> contributor faster.
> >>>>>>>>>>> ====</question unrelated to issue>=====
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks a lot for reading this through and looking forward
> >> to
> >>>>> your
> >>>>>>>>> opinions.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Sohaib
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <
> >>>>> lizhanhui@gmail.com
> >>>>>>>>> <ma...@gmail.com>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Sohaib,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Happy to know you are interested in RocketMQ.
> >>>>>>>>>>>>
> >>>>>>>>>>>> First, let me answer questions you raised.
> >>>>>>>>>>>>
> >>>>>>>>>>>> — can there be multiple tags?
> >>>>>>>>>>>> No. At present, the storage engine allows single tag
> >> only.
> >>>>>>>>> Subscriptions
> >>>>>>>>>>>> are allowed to use combination of tags. The current model
> >>>>> should
> >>>>>>> meet
> >>>>>>>>> your
> >>>>>>>>>>>> business development. If not, please let us know.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> — key (Similar question to above.)
> >>>>>>>>>>>> RocketMQ builds index using message keys. A single
> >> message
> >>>> may
> >>>>>> have
> >>>>>>>>>>>> multiple keys.
> >>>>>>>>>>>>
> >>>>>>>>>>>> — About redundant message
> >>>>>>>>>>>> From my understanding, you are trying to eliminate
> >>> duplicate
> >>>>>>>> messages.
> >>>>>>>>>>>> True there are various reasons which may cause message
> >>>>>> duplication,
> >>>>>>>>> ranging
> >>>>>>>>>>>> from message delivery and consumption. Discussion on this
> >>>> topic
> >>>>>> is
> >>>>>>>>> warmly
> >>>>>>>>>>>> welcome.  Had you had any idea to contribute on this
> >> issue,
> >>>> the
> >>>>>>>>> developer
> >>>>>>>>>>>> board is happy to discuss.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Zhanhui Li
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <
> >>> sohaib1692@gmail.com
> >>>>>>> <mailto:
> >>>>>>>>> sohaib1692@gmail.com>> 写道:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> My earlier email message seems to have gotten lost. So I
> >>>> will
> >>>>>> try
> >>>>>>>>> again.
> >>>>>>>>>>>>> Please see the original message for the discussion.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>> Sohaib Iftikhar
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -- Man is still the most extraordinary computer of
> >> all.--
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <
> >>>>>>>>> sohaib1692@gmail.com <ma...@gmail.com>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I am interested in working on this issue (
> >>>>>>>> https://issues.apache.org/
> >>>>>>>>> <https://issues.apache.org/>
> >>>>>>>>>>>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a
> >>> few
> >>>>>>>> questions
> >>>>>>>>> for
> >>>>>>>>>>>>>> the same. I am not sure if this discussion needs to be
> >> on
> >>>> the
> >>>>>>> JIRA
> >>>>>>>>>>>> issue or
> >>>>>>>>>>>>>> here. Feel free to correct me if this is the wrong
> >>>> platform.
> >>>>>> Also
> >>>>>>>>> while
> >>>>>>>>>>>> I
> >>>>>>>>>>>>>> have worked with distributed pub-sub systems I am still
> >>>>> fairly
> >>>>>>> new
> >>>>>>>> to
> >>>>>>>>>>>>>> Rocket-MQ so maybe my understanding of it is
> >> incorrect. I
> >>>>>>> apologise
> >>>>>>>>> if
> >>>>>>>>>>>> that
> >>>>>>>>>>>>>> is the case and would be happy to stand corrected.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Following are my questions:
> >>>>>>>>>>>>>> 1. What defines a redundant message?
> >>>>>>>>>>>>>>  The constructor that I see for a message is as
> >> follows:
> >>>>>>>>>>>>>>  Message(String topic, String tags, String keys, int
> >>> flag,
> >>>>>>> byte[]
> >>>>>>>>>>>> body,
> >>>>>>>>>>>>>> boolean waitStoreMsgOK)
> >>>>>>>>>>>>>>  Possible candidates to me are topic, tags (can there
> >> be
> >>>>>>> multiple
> >>>>>>>>>>>> tags?
> >>>>>>>>>>>>>> I could not find an example for this. If yes how are
> >> they
> >>>>>>>>> separated?),
> >>>>>>>>>>>> keys
> >>>>>>>>>>>>>> (Similar question to above.) and of course the body. Is
> >>>> there
> >>>>>>>>> something
> >>>>>>>>>>>>>> that I have missed in this? Is there something that we
> >> do
> >>>> not
> >>>>>>> need
> >>>>>>>> to
> >>>>>>>>>>>>>> consider?
> >>>>>>>>>>>>>> 2. Is their a timeline on the redundant messages? What
> >> I
> >>>> mean
> >>>>>> by
> >>>>>>>>> this is
> >>>>>>>>>>>>>> that is there a time limit after which a message with
> >>>> similar
> >>>>>>>>> content is
> >>>>>>>>>>>>>> allowed. From what I gather there was no such thing
> >>>>> mentioned.
> >>>>>>> This
> >>>>>>>>>>>> would
> >>>>>>>>>>>>>> mean storing all the messages. Depending on the
> >>>> requirements
> >>>>>> this
> >>>>>>>>> may or
> >>>>>>>>>>>>>> may not be the best solution. It might be desirable
> >> that
> >>> no
> >>>>>>>>> duplicates
> >>>>>>>>>>>> are
> >>>>>>>>>>>>>> needed within a certain time window (sliding). This
> >>> allows
> >>>>>>> ignoring
> >>>>>>>>> of
> >>>>>>>>>>>>>> duplicate messages that were generated very close to
> >> each
> >>>>> other
> >>>>>>> (or
> >>>>>>>>> in
> >>>>>>>>>>>> the
> >>>>>>>>>>>>>> window indicated). Depending on this requirement
> >>>>> implementation
> >>>>>>> may
> >>>>>>>>>>>> become
> >>>>>>>>>>>>>> a little bit more involved.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> For now, these are the only questions. I have ideas
> >> that
> >>>> need
> >>>>>>>> review
> >>>>>>>>>>>> about
> >>>>>>>>>>>>>> possible implementations but I will mention them once
> >> the
> >>>>>>>>> specifications
> >>>>>>>>>>>>>> are clear to me. As an end question, I would at some
> >>> point
> >>>>> like
> >>>>>>> to
> >>>>>>>>> post
> >>>>>>>>>>>>>> design ideas to this problem privately to get it
> >> reviewed
> >>>> by
> >>>>>> the
> >>>>>>>>>>>>>> development community but not make it publicly
> >> available
> >>> so
> >>>>>> that
> >>>>>>> it
> >>>>>>>>>>>> cannot
> >>>>>>>>>>>>>> be plagiarised. What platform/method can I use to do
> >>> that?
> >>>> Or
> >>>>>> is
> >>>>>>>>>>>> submitting
> >>>>>>>>>>>>>> a draft to the Google platform the only possible way to
> >>>>>>> accomplish
> >>>>>>>>> this?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks a lot for reading this through and looking
> >> forward
> >>>> to
> >>>>>> your
> >>>>>>>>>>>> inputs.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Regards,
> >>>>>>>>>>>>>> Sohaib Iftikhar
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism

Posted by Von Gosling <vo...@apache.org>.
Hi Sohaib,

I have reviewed and made some suggestion for your concern problems.

For all other GSoC students, Could we do some practice like Sohaib, looking forward to your proposal on the Google Doc.

Best Regards,
Von Gosling

> 在 2018年3月9日,15:17,Sohaib Iftikhar <so...@gmail.com> 写道:
> 
> Hi Yukon,
> 
> What do you suggest for the key store itself? Do you propose writing this
> ourselves or using some existing solution and writing a layer on top?
> 
> Thanks,
> Sohaib
> 
> On Fri, Mar 9, 2018 at 6:20 AM, yukon <yu...@apache.org> wrote:
> 
>> ```
>> Personally, I find RAFT to be much simpler to implement. However, I do not
>> expect to reinvent the wheel here.
>> ```
>> 
>> That's absolutely right, no need to reinvent the wheel, there are many
>> existing implementations for raft: https://raft.github.io/
>> 
>> ```
>> I don't think using key store to persist all the messages is a good idea.
>> ```
>> 
>> Yes, store an ID is enough.
>> 
>> 
>> On Thu, Mar 8, 2018 at 3:32 PM, Sohaib Iftikhar <so...@gmail.com>
>> wrote:
>> 
>>> Hi Dexin,
>>> 
>>> Thank you for your suggestions. I will try to answer as much as I can and
>>> leave the rest to the RocketMQ team.
>>> 
>>> 1. The idea with incremental Ids is actually quite good. But @Yukon
>>> mentioned that duplication can also be controlled by an application
>>> (special KV Property) in which case different producers may produce the
>>> same message that needs to deduplicated on the broker.
>>> SessionId+IncrementalId won't work in this scenario I believe. But we can
>>> actually switch to more efficient storage using the idea you described
>> when
>>> the user is not specifying these special keys.
>>> Also I proposed storing of keys for only a fixed time interval. For all
>>> practical purposes this would still remain constant time. [Log base 2 of
>>> 10^10 is still just 33 :) ]. It does add the extra cost of communication
>>> but this would be the case in both scenarios.
>>> 2. As for consensus, the ideas I presented were pretty abstract so I
>>> mentioned a couple of algorithms that could potentially be used.
>>> Personally, I find RAFT to be much simpler to implement. However, I do
>> not
>>> expect to reinvent the wheel here. I strongly believe that in this case,
>> we
>>> can build upon some tested existing solution.
>>> 
>>> 
>>> Regards,
>>> Sohaib
>>> 
>>> On Thu, Mar 8, 2018 at 1:31 AM, 李 德鑫 <de...@outlook.com> wrote:
>>> 
>>>> Hi Sohaib,
>>>> 
>>>> 
>>>> I‘m a student applying for GSOC too. And I've read all of your
>> discussion
>>>> in the mail list.
>>>> 
>>>> I have some questions about your design, and some of the questions may
>>>> need to be answered by RocketMQ team. So I send them here to be
>>> discussed.
>>>> 
>>>> I don't think using key store to persist all the messages is a good
>> idea.
>>>> Since MQ is based on O(1) data structure. The key store would harm the
>>>> performance.
>>>> 
>>>> I think we can learn from TCP protocol.
>>>> 
>>>> In Producer-Broker Communication, we can give an incremental id for
>> every
>>>> message sent in the same session. And the session id should be
>> persistent
>>>> on the disk for producer. So the broker only need to maintain a map
>>> between
>>>> session id to expected message id(And this is how Kafka do it). Since
>>>> messages are much more than producers. However, there's still a K/V
>> store
>>>> needed. So we have to ask RocketMQ team about how many producers in the
>>>> same time while in practical situation.
>>>> 
>>>> Also, the same idea in Consumer-Broker Communication.
>>>> 
>>>> 
>>>> About consensus algorithm, I think RocketMQ should already have an
>>>> implementation there. I don't know what it is, but maybe you can reuse
>>>> that. Or what if you have to implement one, in my opinion, there's no
>>> need
>>>> to implement both Paxos and Raft. Since they solve the same kind of
>>>> problems.
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> 
>>>> Dexin
>>>> 
>>>> 
>>>> ________________________________
>>>> 发件人: Sohaib Iftikhar <so...@gmail.com>
>>>> 发送时间: 2018年3月7日 18:15:51
>>>> 收件人: dev@rocketmq.apache.org
>>>> 主题: Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery
>>>> mechanism
>>>> 
>>>> Hi Yukon,
>>>> 
>>>> Thanks for your reply. Yes, it would be nice to concretely define the
>>> scope
>>>> of this project as the doc is a bit ambitious for just a summer. Should
>>> you
>>>> (or anyone else) have questions/suggestions/clarifications I'd be glad
>>> to
>>>> discuss more details.
>>>> 
>>>> Thanks,
>>>> Sohaib
>>>> 
>>>> On Wed, Mar 7, 2018 at 8:58 AM, yukon <yu...@apache.org> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Google doc is better for discussion, your design is great, now we
>> could
>>>>> discuss more details base on it.
>>>>> 
>>>>> Any advice is welcome from RocketMQ community.
>>>>> 
>>>>> Appreciate your efforts.
>>>>> 
>>>>> Regards,
>>>>> yukon
>>>>> 
>>>>> On Wed, Mar 7, 2018 at 5:15 AM, Sohaib Iftikhar <
>> sohaib1692@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> @Yukon Thank you for your reply. This clears some doubts.
>>>>>> 
>>>>>> Sorry for the delay as I was somewhat occupied with another
>> project.
>>> I
>>>>> have
>>>>>> created an initial design doc. Email is a bit cumbersome for
>>> feedback I
>>>>>> wrote this document in two formats:
>>>>>> 
>>>>>> 1. In the form of a Google document:
>>>>>> https://docs.google.com/document/d/1KSpXGNDH0HF5E27lfKJxJnjIjPtlP
>>>>>> 1Q-M6rj3yZde24.
>>>>>> The document is open for comments to all users without signing in.
>> I
>>>>> would
>>>>>> appreciate it if you put your name before the comment so I can
>>> identify
>>>>> who
>>>>>> to follow up the discussion with.
>>>>>> 
>>>>>> 2. As a markdown on github:
>>>>>> https://github.com/sohaibiftikhar/rocketmq/blob/
>>>>> gsoc_design/gsoc_design.md
>>>>>> .
>>>>>> The comments for this can be made on the commit:
>>>>>> https://github.com/sohaibiftikhar/rocketmq/commit/
>>>>>> dfd55fc69f430fc024217a3b20dde31717334e62
>>>>>> 
>>>>>> After I have received a certain amount of feedback I will try to
>>>>>> incorporate it and put in a subsequent version for review. Please
>>> tell
>>>> me
>>>>>> which methods suits you better (gdoc or github) for review and we
>> can
>>>>>> continue with that for the subsequent versions.
>>>>>> 
>>>>>> Lastly, the document is a couple of pages so I appreciate your
>>> patience
>>>>> and
>>>>>> your help.
>>>>>> Looking forward to your opinions.
>>>>>> 
>>>>>> Thanks,
>>>>>> Sohaib
>>>>>> 
>>>>>> On Mon, Mar 5, 2018 at 1:01 PM, yukon <yu...@apache.org> wrote:
>>>>>> 
>>>>>>> Hi Sohaib,
>>>>>>> 
>>>>>>> Sorry for the late reply, we could move this project forward now
>> ~
>>>>>>> 
>>>>>>> ```
>>>>>>> I would at some point like to post
>>>>>>> design ideas to this problem privately to get it reviewed by the
>>>>>>> development community but not make it publicly available so that
>> it
>>>>>> cannot
>>>>>>> be plagiarised.
>>>>>>> ```
>>>>>>> 
>>>>>>> You can send your design ideas to me directly or to our PMC list(
>>>>>>> private@rocketmq.apache.org) if you want to make your ideas
>>>> privately.
>>>>>> But
>>>>>>> please don't break away from the community.
>>>>>>> 
>>>>>>> I hope you have already understood the goal of this project. Now,
>>>>>> RocketMQ
>>>>>>> support At-least-once delivery, it's an obvious solution
>>>>>>> that achieves Exactly-Once by removing duplicated messages.
>>>>>>> 
>>>>>>> Return to your original questions:
>>>>>>> 
>>>>>>> 1. What defines a redundant message?
>>>>>>> 
>>>>>>> A message id will be generated when new a message, so this id can
>>> be
>>>>> used
>>>>>>> to identify a message. Also, the user could specify a unique
>>>>>>> business-related property to identify a message.
>>>>>>> 
>>>>>>> The redundant messages will occur when the network is broken or
>>>>>>> reconnected, rebalance[1] is triggered, etc.
>>>>>>> 
>>>>>>> 
>>>>>>> 2. Is their a timeline on the redundant messages?
>>>>>>> 
>>>>>>> Yes, keep all messages nonredundant is expensive, let's consider
>>> this
>>>>>>> question within a certain time window ~
>>>>>>> 
>>>>>>> Looking forward to your design.
>>>>>>> 
>>>>>>> [1].
>>>>>>> https://github.com/apache/rocketmq/blob/master/client/
>>>>>>> src/main/java/org/apache/rocketmq/client/impl/consumer/
>>>>>>> RebalanceService.java
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> yukon
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <
>>>> sohaib1692@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> @Zhanhui Thanks for the response. This is not a campaign its
>> just
>>>>> part
>>>>>> of
>>>>>>>> GSoC (https://summerofcode.withgoogle.com/). And community
>> help
>>> is
>>>>>>> gladly
>>>>>>>> welcomed. In fact, it is recommended :)
>>>>>>>> 
>>>>>>>> @KaiYuan Thanks for your suggestions. I will come up with a
>> flow
>>>>> chart
>>>>>>> for
>>>>>>>> the proposed solution this weekend.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Sohaib
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <
>> lizhanhui@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Sohaib,
>>>>>>>>> 
>>>>>>>>> I have been sort of busy this these days. Sorry to reply you
>> so
>>>>> late!
>>>>>>>>> 
>>>>>>>>> So sure what “deadline” you are referring to. If this is part
>>> of
>>>> a
>>>>>>>>> campaign, I have to admit I am not aware of the regulations
>> and
>>>>> what
>>>>>>> kind
>>>>>>>>> of help I should offer to maintain fairness considering other
>>>>> arising
>>>>>>>>> similar issues.
>>>>>>>>> 
>>>>>>>>> Regards!
>>>>>>>>> 
>>>>>>>>> Zhanhui Li
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 在 2018年3月1日,上午3:43,Sohaib Iftikhar <so...@gmail.com>
>>> 写道:
>>>>>>>>>> 
>>>>>>>>>> Hi guys,
>>>>>>>>>> 
>>>>>>>>>> Would be nice to have some feedback on this as the deadline
>>> is
>>>>> not
>>>>>>> too
>>>>>>>>> far :)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Sohaib
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Sohaib Iftikhar
>>>>>>>>>> 
>>>>>>>>>> -- Man is still the most extraordinary computer of all.--
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <
>>>>>>>> sohaib1692@gmail.com
>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>> Thank you for the pointers to the code. This was super
>>> helpful.
>>>>> The
>>>>>>>>> multiple keys can probably be serialized better than
>> separating
>>>>> them
>>>>>>>> with a
>>>>>>>>> space but that is already legacy I suppose.
>>>>>>>>>> 
>>>>>>>>>> Firstly filters like bloom or cuckoo are heuristic. They
>> can
>>>> help
>>>>>>> make
>>>>>>>>> things faster but definitely cannot be used as the only
>>> solution.
>>>>>>> Hence,
>>>>>>>> in
>>>>>>>>> the end, we will still need a persistent keystore/distributed
>>>> set.
>>>>> My
>>>>>>>> plan
>>>>>>>>> was to have this keystore as distributed (raft guarantee
>> etc.).
>>>> The
>>>>>>>>> keystore can also hold a persistent filter on its end. If a
>>>> broker
>>>>>>>>> collapses it can renew/refresh its filter from the keystore.
>>>> Hence
>>>>>>>>> eliminating the problems about crashes that you mention. The
>>>>> problem
>>>>>>> here
>>>>>>>>> could be in maintaining performance for filters in case of
>>>> removals
>>>>>>> from
>>>>>>>>> the keystore (for eg: sliding windows as mentioned in my
>>> previous
>>>>>>> mail).
>>>>>>>>> Periodic refreshal of filters can help solve this but I am
>> open
>>>> to
>>>>>>>>> suggestions on how to make this better.
>>>>>>>>>> 
>>>>>>>>>> I think implementing a distributed set on the client
>> cluster
>>>> has
>>>>>> its
>>>>>>>>> caveats. The way I understand RocketMQ is that we do not have
>>>>> control
>>>>>>>> over
>>>>>>>>> the diskspace/memory on the client end. So we probably only
>>> have
>>>> a
>>>>>>>> constant
>>>>>>>>> amount. A distributed set on the client would also need to be
>>>>>>> persistent.
>>>>>>>>> For eg: if a client restarts/recovers etc. This basically
>> means
>>>> we
>>>>>>> need a
>>>>>>>>> keystore on the client instead of the broker cluster. This
>>>> probably
>>>>>>> puts
>>>>>>>>> too much responsibility on the client cluster. A different
>>>> approach
>>>>>>> would
>>>>>>>>> be to ensure that the offsets are always in sync with the
>>> broker.
>>>>>> Since
>>>>>>>> the
>>>>>>>>> broker only serves unique messages (based on the proposed
>>>> solution
>>>>> on
>>>>>>> the
>>>>>>>>> producer/broker end) all we need to ensure is that a client
>>> does
>>>>> not
>>>>>>>>> consume messages with the same offset twice.
>>>>>>>>>> 
>>>>>>>>>> Please suggest improvements if this does not look like the
>>>>> correct
>>>>>>>>> approach. Also would be great if someone can come up with a
>>>>>> completely
>>>>>>>>> different approach so that we can weigh up pros and cons.
>>>>>>>>>> 
>>>>>>>>>> Thanks for reading this through and looking forward to your
>>>>>> opinions.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Sohaib
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Sohaib Iftikhar
>>>>>>>>>> 
>>>>>>>>>> -- Man is still the most extraordinary computer of all.--
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <
>>>> lizhanhui@gmail.com
>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>> Hi Sohaib,
>>>>>>>>>> 
>>>>>>>>>> About multiple key support, the following code snippet
>> should
>>>>>> clarify
>>>>>>>>> your doubt:
>>>>>>>>>> org.apache.rocketmq.common.message.Message class has
>>>> overloaded
>>>>>>>> setKeys
>>>>>>>>> methods, allowing your to set multiple keys via
>>> string(separated
>>>> by
>>>>>>>>> space…sorry, we have not yet unified all separators, hoping
>>> this
>>>>> does
>>>>>>> not
>>>>>>>>> confuse you) or collection.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> When broker tries to build index for the message with
>>> multiple
>>>>>> keys,
>>>>>>>>> multiple index entries are inserted into the indexing file.
>>>>>>>>>> See org.apache.rocketmq.store.
>> index.IndexService#buildIndex
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> In terms of eliminating message duplication, personally, I
>>> wish
>>>>> we
>>>>>>> have
>>>>>>>>> a feature of exactly-once semantic covering the whole cluster
>>> and
>>>>> the
>>>>>>>>> complete send-store-consume processes. A rough idea is route
>>> the
>>>>>>> message
>>>>>>>>> according to its unique key to a broker according to a rule;
>>> The
>>>>>>> serving
>>>>>>>>> broker ensures uniqueness of the message according to the
>> key(
>>> as
>>>>> you
>>>>>>>> said,
>>>>>>>>> bloom-filter/cuckoo-filter, etc);  Things might looks simple,
>>> but
>>>>>>> issues
>>>>>>>>> resides in scenarios where cluster is experiencing membership
>>>>>> changes:
>>>>>>>> for
>>>>>>>>> example, what if a broker crashed down? We might need
>> propagate
>>>>>>>>> bloom-filter bitset synchronously to other brokers having the
>>>> same
>>>>>>>> topics;
>>>>>>>>> What if a new broker joins in the cluster and starts to
>> serve?
>>> I
>>>> do
>>>>>> not
>>>>>>>>> mean this is too complex to implement. Instead, this is a
>>> pretty
>>>>>>>>> interesting topic and fancy feature to have. Alternatively,
>> we
>>>>> might
>>>>>>>> defer
>>>>>>>>> eliminating duplicates to the consumption phase using kind of
>>>>>>> distributed
>>>>>>>>> set. For sure, my proposing idea suffers the same challenges
>>>>>> including
>>>>>>>>> membership changes.
>>>>>>>>>> 
>>>>>>>>>> Guys of dev board, any insights on this issue?
>>>>>>>>>> 
>>>>>>>>>> Zhanhui Li
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1692@gmail.com
>>>>>> <mailto:
>>>>>>>>> sohaib1692@gmail.com>> 写道:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Zhanhui,
>>>>>>>>>>> 
>>>>>>>>>>> I have a doubt about these multiple keys. If I am wrong in
>>> any
>>>>> of
>>>>>>> the
>>>>>>>>>>> assumptions I make please point it out.
>>>>>>>>>>> 
>>>>>>>>>>> If there is support for multiple keys I cannot see this in
>>> the
>>>>>> code.
>>>>>>>> The
>>>>>>>>>>> class Message only stores a single key in the property map
>>>>> against
>>>>>>> the
>>>>>>>>>>> property name "KEYS". Is this also done in the same ways
>> as
>>>>> tags?
>>>>>>> That
>>>>>>>>> is
>>>>>>>>>>> different keys are separated with ' || '? So basically as
>> a
>>>> user
>>>>>> of
>>>>>>>> the
>>>>>>>>>>> producer API it is the user's responsibility to ensure
>> that
>>> he
>>>>>>>> separates
>>>>>>>>>>> the different keys with the correct separator. I can see
>> an
>>>>>> obvious
>>>>>>>>> problem
>>>>>>>>>>> here. What if the key contains this special character ' ||
>>> '?
>>>>> But
>>>>>>>> maybe
>>>>>>>>>>> this event is rare and hence this is not important. Could
>>> you
>>>>>> point
>>>>>>> me
>>>>>>>>> to
>>>>>>>>>>> some source/doc that explains this part? I was looking at
>>> the
>>>>>> index
>>>>>>>>> section
>>>>>>>>>>> rocketmq-store but I have not been able to understand the
>>>>> indexing
>>>>>>>>> process
>>>>>>>>>>> completely for now. I will keep reading the source to get
>> a
>>>>> better
>>>>>>>> idea.
>>>>>>>>>>> 
>>>>>>>>>>> Moving on to the implementational details. Here is a broad
>>>> idea
>>>>> of
>>>>>>> one
>>>>>>>>>>> possible way to approach it.
>>>>>>>>>>> 
>>>>>>>>>>> The attempt is to remove duplicate messages. In this
>> issue,
>>> I
>>>>>> would
>>>>>>>>> like to
>>>>>>>>>>> aim at eliminating duplicate messages at the
>> producer/broker
>>>>> end.
>>>>>>> For
>>>>>>>>> now,
>>>>>>>>>>> we do not concern ourselves with the duplicate messages
>>>>> happening
>>>>>>> due
>>>>>>>> to
>>>>>>>>>>> unwritten consumer offsets as these two issues have
>>> different
>>>>>>>> solutions.
>>>>>>>>>>> One way to solve this problem at the producer/broker end
>>> could
>>>>> be
>>>>>> to
>>>>>>>>> have a
>>>>>>>>>>> distributed key store that stores the messages. We can
>> make
>>> it
>>>>>>>>> configurable
>>>>>>>>>>> such that this distributed store stores all messages or
>>> works
>>>>> as a
>>>>>>>>> sliding
>>>>>>>>>>> window keeping only the messages from the last X seconds
>>>>> specified
>>>>>>> by
>>>>>>>>> the
>>>>>>>>>>> user. We can have a layer on top to check set membership
>>> such
>>>>> as a
>>>>>>>> bloom
>>>>>>>>>>> filter or a cuckoo filter (
>>>>>>>>>>> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
>> <
>>>>>>>>> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>)
>> to
>>>> help
>>>>>>>>>>> performance. Every message being pushed in by a producer
>> are
>>>>>> checked
>>>>>>>> in
>>>>>>>>>>> first with the filter and in case of a positive result
>> with
>>>> this
>>>>>> key
>>>>>>>>> store.
>>>>>>>>>>> If the message is found then it is discarded. This helps
>>>> remove
>>>>>>>>> duplicates
>>>>>>>>>>> completely from a producer perspective. The core of this
>>> idea
>>>> is
>>>>>> the
>>>>>>>>>>> distributed key store which would be completely separate
>>> from
>>>>> the
>>>>>>>>> current
>>>>>>>>>>> message storage. Since the concept of a distributed key
>>> store
>>>>> or a
>>>>>>>>>>> key/value store is not novel there are two ways to this.
>>>>>>>>>>> 1. Implement it ourselves. This would be high effort but
>> no
>>>>>> external
>>>>>>>>>>> dependencies.
>>>>>>>>>>> 2. Use a key-value store such as Redis (which already has
>>>>> timeouts
>>>>>>> and
>>>>>>>>>>> persistence but a large memory footprint) or some other
>>>>> disk-based
>>>>>>>>> storage
>>>>>>>>>>> for set membership. This would include an external
>>> dependency
>>>>> but
>>>>>>>>>>> development time will reduce significantly for such a
>>>> solution.
>>>>>>>>>>> I am inclined towards implementing it by myself as this
>>> would
>>>>>> avoid
>>>>>>>>>>> dependencies on other products especially since RocketMQ
>> is
>>>>>>> currently
>>>>>>>> a
>>>>>>>>>>> self-reliant system. In addition, my past experience with
>>>>> building
>>>>>>>> such
>>>>>>>>> a
>>>>>>>>>>> store should also come in handy.
>>>>>>>>>>> 
>>>>>>>>>>> I would like to know the opinions of the development
>>> community
>>>>> on
>>>>>>> this
>>>>>>>>>>> approach and to suggest improvements on it. Looking
>> forward
>>> to
>>>>>> your
>>>>>>>>>>> responses to this.
>>>>>>>>>>> 
>>>>>>>>>>> ====<question unrelated to issue>=====
>>>>>>>>>>> To increase my familiarity with the code base and to help
>>>> prove
>>>>>>> that I
>>>>>>>>> am
>>>>>>>>>>> familiar with the tools and technologies in place it would
>>> be
>>>>>> great
>>>>>>>> if I
>>>>>>>>>>> could be pointed to some low effort issues that I could
>> help
>>>> out
>>>>>>> with.
>>>>>>>>> In
>>>>>>>>>>> case there are no 'newbie' issues available I could help
>>>> improve
>>>>>> the
>>>>>>>>>>> comments inside the codebase. I noticed some source files
>>> with
>>>>> no
>>>>>>>>>>> explanations which can be documented via comments to help
>>>>> onboard
>>>>>> a
>>>>>>>> new
>>>>>>>>>>> contributor faster.
>>>>>>>>>>> ====</question unrelated to issue>=====
>>>>>>>>>>> 
>>>>>>>>>>> Thanks a lot for reading this through and looking forward
>> to
>>>>> your
>>>>>>>>> opinions.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Sohaib
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <
>>>>> lizhanhui@gmail.com
>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Sohaib,
>>>>>>>>>>>> 
>>>>>>>>>>>> Happy to know you are interested in RocketMQ.
>>>>>>>>>>>> 
>>>>>>>>>>>> First, let me answer questions you raised.
>>>>>>>>>>>> 
>>>>>>>>>>>> — can there be multiple tags?
>>>>>>>>>>>> No. At present, the storage engine allows single tag
>> only.
>>>>>>>>> Subscriptions
>>>>>>>>>>>> are allowed to use combination of tags. The current model
>>>>> should
>>>>>>> meet
>>>>>>>>> your
>>>>>>>>>>>> business development. If not, please let us know.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> — key (Similar question to above.)
>>>>>>>>>>>> RocketMQ builds index using message keys. A single
>> message
>>>> may
>>>>>> have
>>>>>>>>>>>> multiple keys.
>>>>>>>>>>>> 
>>>>>>>>>>>> — About redundant message
>>>>>>>>>>>> From my understanding, you are trying to eliminate
>>> duplicate
>>>>>>>> messages.
>>>>>>>>>>>> True there are various reasons which may cause message
>>>>>> duplication,
>>>>>>>>> ranging
>>>>>>>>>>>> from message delivery and consumption. Discussion on this
>>>> topic
>>>>>> is
>>>>>>>>> warmly
>>>>>>>>>>>> welcome.  Had you had any idea to contribute on this
>> issue,
>>>> the
>>>>>>>>> developer
>>>>>>>>>>>> board is happy to discuss.
>>>>>>>>>>>> 
>>>>>>>>>>>> Zhanhui Li
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <
>>> sohaib1692@gmail.com
>>>>>>> <mailto:
>>>>>>>>> sohaib1692@gmail.com>> 写道:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My earlier email message seems to have gotten lost. So I
>>>> will
>>>>>> try
>>>>>>>>> again.
>>>>>>>>>>>>> Please see the original message for the discussion.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Sohaib Iftikhar
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- Man is still the most extraordinary computer of
>> all.--
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <
>>>>>>>>> sohaib1692@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am interested in working on this issue (
>>>>>>>> https://issues.apache.org/
>>>>>>>>> <https://issues.apache.org/>
>>>>>>>>>>>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a
>>> few
>>>>>>>> questions
>>>>>>>>> for
>>>>>>>>>>>>>> the same. I am not sure if this discussion needs to be
>> on
>>>> the
>>>>>>> JIRA
>>>>>>>>>>>> issue or
>>>>>>>>>>>>>> here. Feel free to correct me if this is the wrong
>>>> platform.
>>>>>> Also
>>>>>>>>> while
>>>>>>>>>>>> I
>>>>>>>>>>>>>> have worked with distributed pub-sub systems I am still
>>>>> fairly
>>>>>>> new
>>>>>>>> to
>>>>>>>>>>>>>> Rocket-MQ so maybe my understanding of it is
>> incorrect. I
>>>>>>> apologise
>>>>>>>>> if
>>>>>>>>>>>> that
>>>>>>>>>>>>>> is the case and would be happy to stand corrected.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Following are my questions:
>>>>>>>>>>>>>> 1. What defines a redundant message?
>>>>>>>>>>>>>>  The constructor that I see for a message is as
>> follows:
>>>>>>>>>>>>>>  Message(String topic, String tags, String keys, int
>>> flag,
>>>>>>> byte[]
>>>>>>>>>>>> body,
>>>>>>>>>>>>>> boolean waitStoreMsgOK)
>>>>>>>>>>>>>>  Possible candidates to me are topic, tags (can there
>> be
>>>>>>> multiple
>>>>>>>>>>>> tags?
>>>>>>>>>>>>>> I could not find an example for this. If yes how are
>> they
>>>>>>>>> separated?),
>>>>>>>>>>>> keys
>>>>>>>>>>>>>> (Similar question to above.) and of course the body. Is
>>>> there
>>>>>>>>> something
>>>>>>>>>>>>>> that I have missed in this? Is there something that we
>> do
>>>> not
>>>>>>> need
>>>>>>>> to
>>>>>>>>>>>>>> consider?
>>>>>>>>>>>>>> 2. Is their a timeline on the redundant messages? What
>> I
>>>> mean
>>>>>> by
>>>>>>>>> this is
>>>>>>>>>>>>>> that is there a time limit after which a message with
>>>> similar
>>>>>>>>> content is
>>>>>>>>>>>>>> allowed. From what I gather there was no such thing
>>>>> mentioned.
>>>>>>> This
>>>>>>>>>>>> would
>>>>>>>>>>>>>> mean storing all the messages. Depending on the
>>>> requirements
>>>>>> this
>>>>>>>>> may or
>>>>>>>>>>>>>> may not be the best solution. It might be desirable
>> that
>>> no
>>>>>>>>> duplicates
>>>>>>>>>>>> are
>>>>>>>>>>>>>> needed within a certain time window (sliding). This
>>> allows
>>>>>>> ignoring
>>>>>>>>> of
>>>>>>>>>>>>>> duplicate messages that were generated very close to
>> each
>>>>> other
>>>>>>> (or
>>>>>>>>> in
>>>>>>>>>>>> the
>>>>>>>>>>>>>> window indicated). Depending on this requirement
>>>>> implementation
>>>>>>> may
>>>>>>>>>>>> become
>>>>>>>>>>>>>> a little bit more involved.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For now, these are the only questions. I have ideas
>> that
>>>> need
>>>>>>>> review
>>>>>>>>>>>> about
>>>>>>>>>>>>>> possible implementations but I will mention them once
>> the
>>>>>>>>> specifications
>>>>>>>>>>>>>> are clear to me. As an end question, I would at some
>>> point
>>>>> like
>>>>>>> to
>>>>>>>>> post
>>>>>>>>>>>>>> design ideas to this problem privately to get it
>> reviewed
>>>> by
>>>>>> the
>>>>>>>>>>>>>> development community but not make it publicly
>> available
>>> so
>>>>>> that
>>>>>>> it
>>>>>>>>>>>> cannot
>>>>>>>>>>>>>> be plagiarised. What platform/method can I use to do
>>> that?
>>>> Or
>>>>>> is
>>>>>>>>>>>> submitting
>>>>>>>>>>>>>> a draft to the Google platform the only possible way to
>>>>>>> accomplish
>>>>>>>>> this?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks a lot for reading this through and looking
>> forward
>>>> to
>>>>>> your
>>>>>>>>>>>> inputs.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Sohaib Iftikhar
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism

Posted by Von Gosling <vo...@apache.org>.
Hi Sohaib,

I have reviewed and made some suggestion for your concern problems.

For all other GSoC students, Could we do some practice like Sohaib, looking forward to your proposal on the Google Doc.

Best Regards,
Von Gosling

> 在 2018年3月9日,15:17,Sohaib Iftikhar <so...@gmail.com> 写道:
> 
> Hi Yukon,
> 
> What do you suggest for the key store itself? Do you propose writing this
> ourselves or using some existing solution and writing a layer on top?
> 
> Thanks,
> Sohaib
> 
> On Fri, Mar 9, 2018 at 6:20 AM, yukon <yu...@apache.org> wrote:
> 
>> ```
>> Personally, I find RAFT to be much simpler to implement. However, I do not
>> expect to reinvent the wheel here.
>> ```
>> 
>> That's absolutely right, no need to reinvent the wheel, there are many
>> existing implementations for raft: https://raft.github.io/
>> 
>> ```
>> I don't think using key store to persist all the messages is a good idea.
>> ```
>> 
>> Yes, store an ID is enough.
>> 
>> 
>> On Thu, Mar 8, 2018 at 3:32 PM, Sohaib Iftikhar <so...@gmail.com>
>> wrote:
>> 
>>> Hi Dexin,
>>> 
>>> Thank you for your suggestions. I will try to answer as much as I can and
>>> leave the rest to the RocketMQ team.
>>> 
>>> 1. The idea with incremental Ids is actually quite good. But @Yukon
>>> mentioned that duplication can also be controlled by an application
>>> (special KV Property) in which case different producers may produce the
>>> same message that needs to deduplicated on the broker.
>>> SessionId+IncrementalId won't work in this scenario I believe. But we can
>>> actually switch to more efficient storage using the idea you described
>> when
>>> the user is not specifying these special keys.
>>> Also I proposed storing of keys for only a fixed time interval. For all
>>> practical purposes this would still remain constant time. [Log base 2 of
>>> 10^10 is still just 33 :) ]. It does add the extra cost of communication
>>> but this would be the case in both scenarios.
>>> 2. As for consensus, the ideas I presented were pretty abstract so I
>>> mentioned a couple of algorithms that could potentially be used.
>>> Personally, I find RAFT to be much simpler to implement. However, I do
>> not
>>> expect to reinvent the wheel here. I strongly believe that in this case,
>> we
>>> can build upon some tested existing solution.
>>> 
>>> 
>>> Regards,
>>> Sohaib
>>> 
>>> On Thu, Mar 8, 2018 at 1:31 AM, 李 德鑫 <de...@outlook.com> wrote:
>>> 
>>>> Hi Sohaib,
>>>> 
>>>> 
>>>> I‘m a student applying for GSOC too. And I've read all of your
>> discussion
>>>> in the mail list.
>>>> 
>>>> I have some questions about your design, and some of the questions may
>>>> need to be answered by RocketMQ team. So I send them here to be
>>> discussed.
>>>> 
>>>> I don't think using key store to persist all the messages is a good
>> idea.
>>>> Since MQ is based on O(1) data structure. The key store would harm the
>>>> performance.
>>>> 
>>>> I think we can learn from TCP protocol.
>>>> 
>>>> In Producer-Broker Communication, we can give an incremental id for
>> every
>>>> message sent in the same session. And the session id should be
>> persistent
>>>> on the disk for producer. So the broker only need to maintain a map
>>> between
>>>> session id to expected message id(And this is how Kafka do it). Since
>>>> messages are much more than producers. However, there's still a K/V
>> store
>>>> needed. So we have to ask RocketMQ team about how many producers in the
>>>> same time while in practical situation.
>>>> 
>>>> Also, the same idea in Consumer-Broker Communication.
>>>> 
>>>> 
>>>> About consensus algorithm, I think RocketMQ should already have an
>>>> implementation there. I don't know what it is, but maybe you can reuse
>>>> that. Or what if you have to implement one, in my opinion, there's no
>>> need
>>>> to implement both Paxos and Raft. Since they solve the same kind of
>>>> problems.
>>>> 
>>>> 
>>>> 
>>>> Regards,
>>>> 
>>>> Dexin
>>>> 
>>>> 
>>>> ________________________________
>>>> 发件人: Sohaib Iftikhar <so...@gmail.com>
>>>> 发送时间: 2018年3月7日 18:15:51
>>>> 收件人: dev@rocketmq.apache.org
>>>> 主题: Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery
>>>> mechanism
>>>> 
>>>> Hi Yukon,
>>>> 
>>>> Thanks for your reply. Yes, it would be nice to concretely define the
>>> scope
>>>> of this project as the doc is a bit ambitious for just a summer. Should
>>> you
>>>> (or anyone else) have questions/suggestions/clarifications I'd be glad
>>> to
>>>> discuss more details.
>>>> 
>>>> Thanks,
>>>> Sohaib
>>>> 
>>>> On Wed, Mar 7, 2018 at 8:58 AM, yukon <yu...@apache.org> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> Google doc is better for discussion, your design is great, now we
>> could
>>>>> discuss more details base on it.
>>>>> 
>>>>> Any advice is welcome from RocketMQ community.
>>>>> 
>>>>> Appreciate your efforts.
>>>>> 
>>>>> Regards,
>>>>> yukon
>>>>> 
>>>>> On Wed, Mar 7, 2018 at 5:15 AM, Sohaib Iftikhar <
>> sohaib1692@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> @Yukon Thank you for your reply. This clears some doubts.
>>>>>> 
>>>>>> Sorry for the delay as I was somewhat occupied with another
>> project.
>>> I
>>>>> have
>>>>>> created an initial design doc. Email is a bit cumbersome for
>>> feedback I
>>>>>> wrote this document in two formats:
>>>>>> 
>>>>>> 1. In the form of a Google document:
>>>>>> https://docs.google.com/document/d/1KSpXGNDH0HF5E27lfKJxJnjIjPtlP
>>>>>> 1Q-M6rj3yZde24.
>>>>>> The document is open for comments to all users without signing in.
>> I
>>>>> would
>>>>>> appreciate it if you put your name before the comment so I can
>>> identify
>>>>> who
>>>>>> to follow up the discussion with.
>>>>>> 
>>>>>> 2. As a markdown on github:
>>>>>> https://github.com/sohaibiftikhar/rocketmq/blob/
>>>>> gsoc_design/gsoc_design.md
>>>>>> .
>>>>>> The comments for this can be made on the commit:
>>>>>> https://github.com/sohaibiftikhar/rocketmq/commit/
>>>>>> dfd55fc69f430fc024217a3b20dde31717334e62
>>>>>> 
>>>>>> After I have received a certain amount of feedback I will try to
>>>>>> incorporate it and put in a subsequent version for review. Please
>>> tell
>>>> me
>>>>>> which methods suits you better (gdoc or github) for review and we
>> can
>>>>>> continue with that for the subsequent versions.
>>>>>> 
>>>>>> Lastly, the document is a couple of pages so I appreciate your
>>> patience
>>>>> and
>>>>>> your help.
>>>>>> Looking forward to your opinions.
>>>>>> 
>>>>>> Thanks,
>>>>>> Sohaib
>>>>>> 
>>>>>> On Mon, Mar 5, 2018 at 1:01 PM, yukon <yu...@apache.org> wrote:
>>>>>> 
>>>>>>> Hi Sohaib,
>>>>>>> 
>>>>>>> Sorry for the late reply, we could move this project forward now
>> ~
>>>>>>> 
>>>>>>> ```
>>>>>>> I would at some point like to post
>>>>>>> design ideas to this problem privately to get it reviewed by the
>>>>>>> development community but not make it publicly available so that
>> it
>>>>>> cannot
>>>>>>> be plagiarised.
>>>>>>> ```
>>>>>>> 
>>>>>>> You can send your design ideas to me directly or to our PMC list(
>>>>>>> private@rocketmq.apache.org) if you want to make your ideas
>>>> privately.
>>>>>> But
>>>>>>> please don't break away from the community.
>>>>>>> 
>>>>>>> I hope you have already understood the goal of this project. Now,
>>>>>> RocketMQ
>>>>>>> support At-least-once delivery, it's an obvious solution
>>>>>>> that achieves Exactly-Once by removing duplicated messages.
>>>>>>> 
>>>>>>> Return to your original questions:
>>>>>>> 
>>>>>>> 1. What defines a redundant message?
>>>>>>> 
>>>>>>> A message id will be generated when new a message, so this id can
>>> be
>>>>> used
>>>>>>> to identify a message. Also, the user could specify a unique
>>>>>>> business-related property to identify a message.
>>>>>>> 
>>>>>>> The redundant messages will occur when the network is broken or
>>>>>>> reconnected, rebalance[1] is triggered, etc.
>>>>>>> 
>>>>>>> 
>>>>>>> 2. Is their a timeline on the redundant messages?
>>>>>>> 
>>>>>>> Yes, keep all messages nonredundant is expensive, let's consider
>>> this
>>>>>>> question within a certain time window ~
>>>>>>> 
>>>>>>> Looking forward to your design.
>>>>>>> 
>>>>>>> [1].
>>>>>>> https://github.com/apache/rocketmq/blob/master/client/
>>>>>>> src/main/java/org/apache/rocketmq/client/impl/consumer/
>>>>>>> RebalanceService.java
>>>>>>> 
>>>>>>> 
>>>>>>> Regards,
>>>>>>> yukon
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <
>>>> sohaib1692@gmail.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> @Zhanhui Thanks for the response. This is not a campaign its
>> just
>>>>> part
>>>>>> of
>>>>>>>> GSoC (https://summerofcode.withgoogle.com/). And community
>> help
>>> is
>>>>>>> gladly
>>>>>>>> welcomed. In fact, it is recommended :)
>>>>>>>> 
>>>>>>>> @KaiYuan Thanks for your suggestions. I will come up with a
>> flow
>>>>> chart
>>>>>>> for
>>>>>>>> the proposed solution this weekend.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Sohaib
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <
>> lizhanhui@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Hi Sohaib,
>>>>>>>>> 
>>>>>>>>> I have been sort of busy this these days. Sorry to reply you
>> so
>>>>> late!
>>>>>>>>> 
>>>>>>>>> So sure what “deadline” you are referring to. If this is part
>>> of
>>>> a
>>>>>>>>> campaign, I have to admit I am not aware of the regulations
>> and
>>>>> what
>>>>>>> kind
>>>>>>>>> of help I should offer to maintain fairness considering other
>>>>> arising
>>>>>>>>> similar issues.
>>>>>>>>> 
>>>>>>>>> Regards!
>>>>>>>>> 
>>>>>>>>> Zhanhui Li
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 在 2018年3月1日,上午3:43,Sohaib Iftikhar <so...@gmail.com>
>>> 写道:
>>>>>>>>>> 
>>>>>>>>>> Hi guys,
>>>>>>>>>> 
>>>>>>>>>> Would be nice to have some feedback on this as the deadline
>>> is
>>>>> not
>>>>>>> too
>>>>>>>>> far :)
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> Sohaib
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Sohaib Iftikhar
>>>>>>>>>> 
>>>>>>>>>> -- Man is still the most extraordinary computer of all.--
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <
>>>>>>>> sohaib1692@gmail.com
>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>> Thank you for the pointers to the code. This was super
>>> helpful.
>>>>> The
>>>>>>>>> multiple keys can probably be serialized better than
>> separating
>>>>> them
>>>>>>>> with a
>>>>>>>>> space but that is already legacy I suppose.
>>>>>>>>>> 
>>>>>>>>>> Firstly filters like bloom or cuckoo are heuristic. They
>> can
>>>> help
>>>>>>> make
>>>>>>>>> things faster but definitely cannot be used as the only
>>> solution.
>>>>>>> Hence,
>>>>>>>> in
>>>>>>>>> the end, we will still need a persistent keystore/distributed
>>>> set.
>>>>> My
>>>>>>>> plan
>>>>>>>>> was to have this keystore as distributed (raft guarantee
>> etc.).
>>>> The
>>>>>>>>> keystore can also hold a persistent filter on its end. If a
>>>> broker
>>>>>>>>> collapses it can renew/refresh its filter from the keystore.
>>>> Hence
>>>>>>>>> eliminating the problems about crashes that you mention. The
>>>>> problem
>>>>>>> here
>>>>>>>>> could be in maintaining performance for filters in case of
>>>> removals
>>>>>>> from
>>>>>>>>> the keystore (for eg: sliding windows as mentioned in my
>>> previous
>>>>>>> mail).
>>>>>>>>> Periodic refreshal of filters can help solve this but I am
>> open
>>>> to
>>>>>>>>> suggestions on how to make this better.
>>>>>>>>>> 
>>>>>>>>>> I think implementing a distributed set on the client
>> cluster
>>>> has
>>>>>> its
>>>>>>>>> caveats. The way I understand RocketMQ is that we do not have
>>>>> control
>>>>>>>> over
>>>>>>>>> the diskspace/memory on the client end. So we probably only
>>> have
>>>> a
>>>>>>>> constant
>>>>>>>>> amount. A distributed set on the client would also need to be
>>>>>>> persistent.
>>>>>>>>> For eg: if a client restarts/recovers etc. This basically
>> means
>>>> we
>>>>>>> need a
>>>>>>>>> keystore on the client instead of the broker cluster. This
>>>> probably
>>>>>>> puts
>>>>>>>>> too much responsibility on the client cluster. A different
>>>> approach
>>>>>>> would
>>>>>>>>> be to ensure that the offsets are always in sync with the
>>> broker.
>>>>>> Since
>>>>>>>> the
>>>>>>>>> broker only serves unique messages (based on the proposed
>>>> solution
>>>>> on
>>>>>>> the
>>>>>>>>> producer/broker end) all we need to ensure is that a client
>>> does
>>>>> not
>>>>>>>>> consume messages with the same offset twice.
>>>>>>>>>> 
>>>>>>>>>> Please suggest improvements if this does not look like the
>>>>> correct
>>>>>>>>> approach. Also would be great if someone can come up with a
>>>>>> completely
>>>>>>>>> different approach so that we can weigh up pros and cons.
>>>>>>>>>> 
>>>>>>>>>> Thanks for reading this through and looking forward to your
>>>>>> opinions.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Sohaib
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Sohaib Iftikhar
>>>>>>>>>> 
>>>>>>>>>> -- Man is still the most extraordinary computer of all.--
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <
>>>> lizhanhui@gmail.com
>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>> Hi Sohaib,
>>>>>>>>>> 
>>>>>>>>>> About multiple key support, the following code snippet
>> should
>>>>>> clarify
>>>>>>>>> your doubt:
>>>>>>>>>> org.apache.rocketmq.common.message.Message class has
>>>> overloaded
>>>>>>>> setKeys
>>>>>>>>> methods, allowing your to set multiple keys via
>>> string(separated
>>>> by
>>>>>>>>> space…sorry, we have not yet unified all separators, hoping
>>> this
>>>>> does
>>>>>>> not
>>>>>>>>> confuse you) or collection.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> When broker tries to build index for the message with
>>> multiple
>>>>>> keys,
>>>>>>>>> multiple index entries are inserted into the indexing file.
>>>>>>>>>> See org.apache.rocketmq.store.
>> index.IndexService#buildIndex
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> In terms of eliminating message duplication, personally, I
>>> wish
>>>>> we
>>>>>>> have
>>>>>>>>> a feature of exactly-once semantic covering the whole cluster
>>> and
>>>>> the
>>>>>>>>> complete send-store-consume processes. A rough idea is route
>>> the
>>>>>>> message
>>>>>>>>> according to its unique key to a broker according to a rule;
>>> The
>>>>>>> serving
>>>>>>>>> broker ensures uniqueness of the message according to the
>> key(
>>> as
>>>>> you
>>>>>>>> said,
>>>>>>>>> bloom-filter/cuckoo-filter, etc);  Things might looks simple,
>>> but
>>>>>>> issues
>>>>>>>>> resides in scenarios where cluster is experiencing membership
>>>>>> changes:
>>>>>>>> for
>>>>>>>>> example, what if a broker crashed down? We might need
>> propagate
>>>>>>>>> bloom-filter bitset synchronously to other brokers having the
>>>> same
>>>>>>>> topics;
>>>>>>>>> What if a new broker joins in the cluster and starts to
>> serve?
>>> I
>>>> do
>>>>>> not
>>>>>>>>> mean this is too complex to implement. Instead, this is a
>>> pretty
>>>>>>>>> interesting topic and fancy feature to have. Alternatively,
>> we
>>>>> might
>>>>>>>> defer
>>>>>>>>> eliminating duplicates to the consumption phase using kind of
>>>>>>> distributed
>>>>>>>>> set. For sure, my proposing idea suffers the same challenges
>>>>>> including
>>>>>>>>> membership changes.
>>>>>>>>>> 
>>>>>>>>>> Guys of dev board, any insights on this issue?
>>>>>>>>>> 
>>>>>>>>>> Zhanhui Li
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1692@gmail.com
>>>>>> <mailto:
>>>>>>>>> sohaib1692@gmail.com>> 写道:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Zhanhui,
>>>>>>>>>>> 
>>>>>>>>>>> I have a doubt about these multiple keys. If I am wrong in
>>> any
>>>>> of
>>>>>>> the
>>>>>>>>>>> assumptions I make please point it out.
>>>>>>>>>>> 
>>>>>>>>>>> If there is support for multiple keys I cannot see this in
>>> the
>>>>>> code.
>>>>>>>> The
>>>>>>>>>>> class Message only stores a single key in the property map
>>>>> against
>>>>>>> the
>>>>>>>>>>> property name "KEYS". Is this also done in the same ways
>> as
>>>>> tags?
>>>>>>> That
>>>>>>>>> is
>>>>>>>>>>> different keys are separated with ' || '? So basically as
>> a
>>>> user
>>>>>> of
>>>>>>>> the
>>>>>>>>>>> producer API it is the user's responsibility to ensure
>> that
>>> he
>>>>>>>> separates
>>>>>>>>>>> the different keys with the correct separator. I can see
>> an
>>>>>> obvious
>>>>>>>>> problem
>>>>>>>>>>> here. What if the key contains this special character ' ||
>>> '?
>>>>> But
>>>>>>>> maybe
>>>>>>>>>>> this event is rare and hence this is not important. Could
>>> you
>>>>>> point
>>>>>>> me
>>>>>>>>> to
>>>>>>>>>>> some source/doc that explains this part? I was looking at
>>> the
>>>>>> index
>>>>>>>>> section
>>>>>>>>>>> rocketmq-store but I have not been able to understand the
>>>>> indexing
>>>>>>>>> process
>>>>>>>>>>> completely for now. I will keep reading the source to get
>> a
>>>>> better
>>>>>>>> idea.
>>>>>>>>>>> 
>>>>>>>>>>> Moving on to the implementational details. Here is a broad
>>>> idea
>>>>> of
>>>>>>> one
>>>>>>>>>>> possible way to approach it.
>>>>>>>>>>> 
>>>>>>>>>>> The attempt is to remove duplicate messages. In this
>> issue,
>>> I
>>>>>> would
>>>>>>>>> like to
>>>>>>>>>>> aim at eliminating duplicate messages at the
>> producer/broker
>>>>> end.
>>>>>>> For
>>>>>>>>> now,
>>>>>>>>>>> we do not concern ourselves with the duplicate messages
>>>>> happening
>>>>>>> due
>>>>>>>> to
>>>>>>>>>>> unwritten consumer offsets as these two issues have
>>> different
>>>>>>>> solutions.
>>>>>>>>>>> One way to solve this problem at the producer/broker end
>>> could
>>>>> be
>>>>>> to
>>>>>>>>> have a
>>>>>>>>>>> distributed key store that stores the messages. We can
>> make
>>> it
>>>>>>>>> configurable
>>>>>>>>>>> such that this distributed store stores all messages or
>>> works
>>>>> as a
>>>>>>>>> sliding
>>>>>>>>>>> window keeping only the messages from the last X seconds
>>>>> specified
>>>>>>> by
>>>>>>>>> the
>>>>>>>>>>> user. We can have a layer on top to check set membership
>>> such
>>>>> as a
>>>>>>>> bloom
>>>>>>>>>>> filter or a cuckoo filter (
>>>>>>>>>>> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
>> <
>>>>>>>>> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>)
>> to
>>>> help
>>>>>>>>>>> performance. Every message being pushed in by a producer
>> are
>>>>>> checked
>>>>>>>> in
>>>>>>>>>>> first with the filter and in case of a positive result
>> with
>>>> this
>>>>>> key
>>>>>>>>> store.
>>>>>>>>>>> If the message is found then it is discarded. This helps
>>>> remove
>>>>>>>>> duplicates
>>>>>>>>>>> completely from a producer perspective. The core of this
>>> idea
>>>> is
>>>>>> the
>>>>>>>>>>> distributed key store which would be completely separate
>>> from
>>>>> the
>>>>>>>>> current
>>>>>>>>>>> message storage. Since the concept of a distributed key
>>> store
>>>>> or a
>>>>>>>>>>> key/value store is not novel there are two ways to this.
>>>>>>>>>>> 1. Implement it ourselves. This would be high effort but
>> no
>>>>>> external
>>>>>>>>>>> dependencies.
>>>>>>>>>>> 2. Use a key-value store such as Redis (which already has
>>>>> timeouts
>>>>>>> and
>>>>>>>>>>> persistence but a large memory footprint) or some other
>>>>> disk-based
>>>>>>>>> storage
>>>>>>>>>>> for set membership. This would include an external
>>> dependency
>>>>> but
>>>>>>>>>>> development time will reduce significantly for such a
>>>> solution.
>>>>>>>>>>> I am inclined towards implementing it by myself as this
>>> would
>>>>>> avoid
>>>>>>>>>>> dependencies on other products especially since RocketMQ
>> is
>>>>>>> currently
>>>>>>>> a
>>>>>>>>>>> self-reliant system. In addition, my past experience with
>>>>> building
>>>>>>>> such
>>>>>>>>> a
>>>>>>>>>>> store should also come in handy.
>>>>>>>>>>> 
>>>>>>>>>>> I would like to know the opinions of the development
>>> community
>>>>> on
>>>>>>> this
>>>>>>>>>>> approach and to suggest improvements on it. Looking
>> forward
>>> to
>>>>>> your
>>>>>>>>>>> responses to this.
>>>>>>>>>>> 
>>>>>>>>>>> ====<question unrelated to issue>=====
>>>>>>>>>>> To increase my familiarity with the code base and to help
>>>> prove
>>>>>>> that I
>>>>>>>>> am
>>>>>>>>>>> familiar with the tools and technologies in place it would
>>> be
>>>>>> great
>>>>>>>> if I
>>>>>>>>>>> could be pointed to some low effort issues that I could
>> help
>>>> out
>>>>>>> with.
>>>>>>>>> In
>>>>>>>>>>> case there are no 'newbie' issues available I could help
>>>> improve
>>>>>> the
>>>>>>>>>>> comments inside the codebase. I noticed some source files
>>> with
>>>>> no
>>>>>>>>>>> explanations which can be documented via comments to help
>>>>> onboard
>>>>>> a
>>>>>>>> new
>>>>>>>>>>> contributor faster.
>>>>>>>>>>> ====</question unrelated to issue>=====
>>>>>>>>>>> 
>>>>>>>>>>> Thanks a lot for reading this through and looking forward
>> to
>>>>> your
>>>>>>>>> opinions.
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>> Sohaib
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <
>>>>> lizhanhui@gmail.com
>>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Sohaib,
>>>>>>>>>>>> 
>>>>>>>>>>>> Happy to know you are interested in RocketMQ.
>>>>>>>>>>>> 
>>>>>>>>>>>> First, let me answer questions you raised.
>>>>>>>>>>>> 
>>>>>>>>>>>> — can there be multiple tags?
>>>>>>>>>>>> No. At present, the storage engine allows single tag
>> only.
>>>>>>>>> Subscriptions
>>>>>>>>>>>> are allowed to use combination of tags. The current model
>>>>> should
>>>>>>> meet
>>>>>>>>> your
>>>>>>>>>>>> business development. If not, please let us know.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> — key (Similar question to above.)
>>>>>>>>>>>> RocketMQ builds index using message keys. A single
>> message
>>>> may
>>>>>> have
>>>>>>>>>>>> multiple keys.
>>>>>>>>>>>> 
>>>>>>>>>>>> — About redundant message
>>>>>>>>>>>> From my understanding, you are trying to eliminate
>>> duplicate
>>>>>>>> messages.
>>>>>>>>>>>> True there are various reasons which may cause message
>>>>>> duplication,
>>>>>>>>> ranging
>>>>>>>>>>>> from message delivery and consumption. Discussion on this
>>>> topic
>>>>>> is
>>>>>>>>> warmly
>>>>>>>>>>>> welcome.  Had you had any idea to contribute on this
>> issue,
>>>> the
>>>>>>>>> developer
>>>>>>>>>>>> board is happy to discuss.
>>>>>>>>>>>> 
>>>>>>>>>>>> Zhanhui Li
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <
>>> sohaib1692@gmail.com
>>>>>>> <mailto:
>>>>>>>>> sohaib1692@gmail.com>> 写道:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> My earlier email message seems to have gotten lost. So I
>>>> will
>>>>>> try
>>>>>>>>> again.
>>>>>>>>>>>>> Please see the original message for the discussion.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Sohaib Iftikhar
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -- Man is still the most extraordinary computer of
>> all.--
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <
>>>>>>>>> sohaib1692@gmail.com <ma...@gmail.com>>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am interested in working on this issue (
>>>>>>>> https://issues.apache.org/
>>>>>>>>> <https://issues.apache.org/>
>>>>>>>>>>>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a
>>> few
>>>>>>>> questions
>>>>>>>>> for
>>>>>>>>>>>>>> the same. I am not sure if this discussion needs to be
>> on
>>>> the
>>>>>>> JIRA
>>>>>>>>>>>> issue or
>>>>>>>>>>>>>> here. Feel free to correct me if this is the wrong
>>>> platform.
>>>>>> Also
>>>>>>>>> while
>>>>>>>>>>>> I
>>>>>>>>>>>>>> have worked with distributed pub-sub systems I am still
>>>>> fairly
>>>>>>> new
>>>>>>>> to
>>>>>>>>>>>>>> Rocket-MQ so maybe my understanding of it is
>> incorrect. I
>>>>>>> apologise
>>>>>>>>> if
>>>>>>>>>>>> that
>>>>>>>>>>>>>> is the case and would be happy to stand corrected.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Following are my questions:
>>>>>>>>>>>>>> 1. What defines a redundant message?
>>>>>>>>>>>>>>  The constructor that I see for a message is as
>> follows:
>>>>>>>>>>>>>>  Message(String topic, String tags, String keys, int
>>> flag,
>>>>>>> byte[]
>>>>>>>>>>>> body,
>>>>>>>>>>>>>> boolean waitStoreMsgOK)
>>>>>>>>>>>>>>  Possible candidates to me are topic, tags (can there
>> be
>>>>>>> multiple
>>>>>>>>>>>> tags?
>>>>>>>>>>>>>> I could not find an example for this. If yes how are
>> they
>>>>>>>>> separated?),
>>>>>>>>>>>> keys
>>>>>>>>>>>>>> (Similar question to above.) and of course the body. Is
>>>> there
>>>>>>>>> something
>>>>>>>>>>>>>> that I have missed in this? Is there something that we
>> do
>>>> not
>>>>>>> need
>>>>>>>> to
>>>>>>>>>>>>>> consider?
>>>>>>>>>>>>>> 2. Is their a timeline on the redundant messages? What
>> I
>>>> mean
>>>>>> by
>>>>>>>>> this is
>>>>>>>>>>>>>> that is there a time limit after which a message with
>>>> similar
>>>>>>>>> content is
>>>>>>>>>>>>>> allowed. From what I gather there was no such thing
>>>>> mentioned.
>>>>>>> This
>>>>>>>>>>>> would
>>>>>>>>>>>>>> mean storing all the messages. Depending on the
>>>> requirements
>>>>>> this
>>>>>>>>> may or
>>>>>>>>>>>>>> may not be the best solution. It might be desirable
>> that
>>> no
>>>>>>>>> duplicates
>>>>>>>>>>>> are
>>>>>>>>>>>>>> needed within a certain time window (sliding). This
>>> allows
>>>>>>> ignoring
>>>>>>>>> of
>>>>>>>>>>>>>> duplicate messages that were generated very close to
>> each
>>>>> other
>>>>>>> (or
>>>>>>>>> in
>>>>>>>>>>>> the
>>>>>>>>>>>>>> window indicated). Depending on this requirement
>>>>> implementation
>>>>>>> may
>>>>>>>>>>>> become
>>>>>>>>>>>>>> a little bit more involved.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For now, these are the only questions. I have ideas
>> that
>>>> need
>>>>>>>> review
>>>>>>>>>>>> about
>>>>>>>>>>>>>> possible implementations but I will mention them once
>> the
>>>>>>>>> specifications
>>>>>>>>>>>>>> are clear to me. As an end question, I would at some
>>> point
>>>>> like
>>>>>>> to
>>>>>>>>> post
>>>>>>>>>>>>>> design ideas to this problem privately to get it
>> reviewed
>>>> by
>>>>>> the
>>>>>>>>>>>>>> development community but not make it publicly
>> available
>>> so
>>>>>> that
>>>>>>> it
>>>>>>>>>>>> cannot
>>>>>>>>>>>>>> be plagiarised. What platform/method can I use to do
>>> that?
>>>> Or
>>>>>> is
>>>>>>>>>>>> submitting
>>>>>>>>>>>>>> a draft to the Google platform the only possible way to
>>>>>>> accomplish
>>>>>>>>> this?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks a lot for reading this through and looking
>> forward
>>>> to
>>>>>> your
>>>>>>>>>>>> inputs.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>> Sohaib Iftikhar
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: 答复: [GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism

Posted by Sohaib Iftikhar <so...@gmail.com>.
Hi Yukon,

What do you suggest for the key store itself? Do you propose writing this
ourselves or using some existing solution and writing a layer on top?

Thanks,
Sohaib

On Fri, Mar 9, 2018 at 6:20 AM, yukon <yu...@apache.org> wrote:

> ```
> Personally, I find RAFT to be much simpler to implement. However, I do not
> expect to reinvent the wheel here.
> ```
>
> That's absolutely right, no need to reinvent the wheel, there are many
> existing implementations for raft: https://raft.github.io/
>
> ```
> I don't think using key store to persist all the messages is a good idea.
> ```
>
> Yes, store an ID is enough.
>
>
> On Thu, Mar 8, 2018 at 3:32 PM, Sohaib Iftikhar <so...@gmail.com>
> wrote:
>
> > Hi Dexin,
> >
> > Thank you for your suggestions. I will try to answer as much as I can and
> > leave the rest to the RocketMQ team.
> >
> > 1. The idea with incremental Ids is actually quite good. But @Yukon
> > mentioned that duplication can also be controlled by an application
> > (special KV Property) in which case different producers may produce the
> > same message that needs to deduplicated on the broker.
> > SessionId+IncrementalId won't work in this scenario I believe. But we can
> > actually switch to more efficient storage using the idea you described
> when
> > the user is not specifying these special keys.
> > Also I proposed storing of keys for only a fixed time interval. For all
> > practical purposes this would still remain constant time. [Log base 2 of
> > 10^10 is still just 33 :) ]. It does add the extra cost of communication
> > but this would be the case in both scenarios.
> > 2. As for consensus, the ideas I presented were pretty abstract so I
> > mentioned a couple of algorithms that could potentially be used.
> > Personally, I find RAFT to be much simpler to implement. However, I do
> not
> > expect to reinvent the wheel here. I strongly believe that in this case,
> we
> > can build upon some tested existing solution.
> >
> >
> > Regards,
> > Sohaib
> >
> > On Thu, Mar 8, 2018 at 1:31 AM, 李 德鑫 <de...@outlook.com> wrote:
> >
> > > Hi Sohaib,
> > >
> > >
> > > I‘m a student applying for GSOC too. And I've read all of your
> discussion
> > > in the mail list.
> > >
> > > I have some questions about your design, and some of the questions may
> > > need to be answered by RocketMQ team. So I send them here to be
> > discussed.
> > >
> > > I don't think using key store to persist all the messages is a good
> idea.
> > > Since MQ is based on O(1) data structure. The key store would harm the
> > > performance.
> > >
> > > I think we can learn from TCP protocol.
> > >
> > > In Producer-Broker Communication, we can give an incremental id for
> every
> > > message sent in the same session. And the session id should be
> persistent
> > > on the disk for producer. So the broker only need to maintain a map
> > between
> > > session id to expected message id(And this is how Kafka do it). Since
> > > messages are much more than producers. However, there's still a K/V
> store
> > > needed. So we have to ask RocketMQ team about how many producers in the
> > > same time while in practical situation.
> > >
> > > Also, the same idea in Consumer-Broker Communication.
> > >
> > >
> > > About consensus algorithm, I think RocketMQ should already have an
> > > implementation there. I don't know what it is, but maybe you can reuse
> > > that. Or what if you have to implement one, in my opinion, there's no
> > need
> > > to implement both Paxos and Raft. Since they solve the same kind of
> > > problems.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Dexin
> > >
> > >
> > > ________________________________
> > > 发件人: Sohaib Iftikhar <so...@gmail.com>
> > > 发送时间: 2018年3月7日 18:15:51
> > > 收件人: dev@rocketmq.apache.org
> > > 主题: Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery
> > > mechanism
> > >
> > > Hi Yukon,
> > >
> > > Thanks for your reply. Yes, it would be nice to concretely define the
> > scope
> > > of this project as the doc is a bit ambitious for just a summer. Should
> > you
> > > (or anyone else) have questions/suggestions/clarifications I'd be glad
> > to
> > > discuss more details.
> > >
> > > Thanks,
> > > Sohaib
> > >
> > > On Wed, Mar 7, 2018 at 8:58 AM, yukon <yu...@apache.org> wrote:
> > >
> > > > Hi,
> > > >
> > > > Google doc is better for discussion, your design is great, now we
> could
> > > > discuss more details base on it.
> > > >
> > > > Any advice is welcome from RocketMQ community.
> > > >
> > > > Appreciate your efforts.
> > > >
> > > > Regards,
> > > > yukon
> > > >
> > > > On Wed, Mar 7, 2018 at 5:15 AM, Sohaib Iftikhar <
> sohaib1692@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > @Yukon Thank you for your reply. This clears some doubts.
> > > > >
> > > > > Sorry for the delay as I was somewhat occupied with another
> project.
> > I
> > > > have
> > > > > created an initial design doc. Email is a bit cumbersome for
> > feedback I
> > > > > wrote this document in two formats:
> > > > >
> > > > > 1. In the form of a Google document:
> > > > > https://docs.google.com/document/d/1KSpXGNDH0HF5E27lfKJxJnjIjPtlP
> > > > > 1Q-M6rj3yZde24.
> > > > > The document is open for comments to all users without signing in.
> I
> > > > would
> > > > > appreciate it if you put your name before the comment so I can
> > identify
> > > > who
> > > > > to follow up the discussion with.
> > > > >
> > > > > 2. As a markdown on github:
> > > > > https://github.com/sohaibiftikhar/rocketmq/blob/
> > > > gsoc_design/gsoc_design.md
> > > > > .
> > > > > The comments for this can be made on the commit:
> > > > > https://github.com/sohaibiftikhar/rocketmq/commit/
> > > > > dfd55fc69f430fc024217a3b20dde31717334e62
> > > > >
> > > > > After I have received a certain amount of feedback I will try to
> > > > > incorporate it and put in a subsequent version for review. Please
> > tell
> > > me
> > > > > which methods suits you better (gdoc or github) for review and we
> can
> > > > > continue with that for the subsequent versions.
> > > > >
> > > > > Lastly, the document is a couple of pages so I appreciate your
> > patience
> > > > and
> > > > > your help.
> > > > > Looking forward to your opinions.
> > > > >
> > > > > Thanks,
> > > > > Sohaib
> > > > >
> > > > > On Mon, Mar 5, 2018 at 1:01 PM, yukon <yu...@apache.org> wrote:
> > > > >
> > > > > > Hi Sohaib,
> > > > > >
> > > > > > Sorry for the late reply, we could move this project forward now
> ~
> > > > > >
> > > > > > ```
> > > > > > I would at some point like to post
> > > > > > design ideas to this problem privately to get it reviewed by the
> > > > > > development community but not make it publicly available so that
> it
> > > > > cannot
> > > > > > be plagiarised.
> > > > > > ```
> > > > > >
> > > > > > You can send your design ideas to me directly or to our PMC list(
> > > > > > private@rocketmq.apache.org) if you want to make your ideas
> > > privately.
> > > > > But
> > > > > > please don't break away from the community.
> > > > > >
> > > > > > I hope you have already understood the goal of this project. Now,
> > > > > RocketMQ
> > > > > > support At-least-once delivery, it's an obvious solution
> > > > > > that achieves Exactly-Once by removing duplicated messages.
> > > > > >
> > > > > > Return to your original questions:
> > > > > >
> > > > > > 1. What defines a redundant message?
> > > > > >
> > > > > > A message id will be generated when new a message, so this id can
> > be
> > > > used
> > > > > > to identify a message. Also, the user could specify a unique
> > > > > > business-related property to identify a message.
> > > > > >
> > > > > > The redundant messages will occur when the network is broken or
> > > > > > reconnected, rebalance[1] is triggered, etc.
> > > > > >
> > > > > >
> > > > > > 2. Is their a timeline on the redundant messages?
> > > > > >
> > > > > > Yes, keep all messages nonredundant is expensive, let's consider
> > this
> > > > > > question within a certain time window ~
> > > > > >
> > > > > > Looking forward to your design.
> > > > > >
> > > > > > [1].
> > > > > > https://github.com/apache/rocketmq/blob/master/client/
> > > > > > src/main/java/org/apache/rocketmq/client/impl/consumer/
> > > > > > RebalanceService.java
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > > yukon
> > > > > >
> > > > > >
> > > > > > On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <
> > > sohaib1692@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > @Zhanhui Thanks for the response. This is not a campaign its
> just
> > > > part
> > > > > of
> > > > > > > GSoC (https://summerofcode.withgoogle.com/). And community
> help
> > is
> > > > > > gladly
> > > > > > > welcomed. In fact, it is recommended :)
> > > > > > >
> > > > > > > @KaiYuan Thanks for your suggestions. I will come up with a
> flow
> > > > chart
> > > > > > for
> > > > > > > the proposed solution this weekend.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Sohaib
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <
> lizhanhui@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Sohaib,
> > > > > > > >
> > > > > > > > I have been sort of busy this these days. Sorry to reply you
> so
> > > > late!
> > > > > > > >
> > > > > > > > So sure what “deadline” you are referring to. If this is part
> > of
> > > a
> > > > > > > > campaign, I have to admit I am not aware of the regulations
> and
> > > > what
> > > > > > kind
> > > > > > > > of help I should offer to maintain fairness considering other
> > > > arising
> > > > > > > > similar issues.
> > > > > > > >
> > > > > > > > Regards!
> > > > > > > >
> > > > > > > > Zhanhui Li
> > > > > > > >
> > > > > > > >
> > > > > > > > > 在 2018年3月1日,上午3:43,Sohaib Iftikhar <so...@gmail.com>
> > 写道:
> > > > > > > > >
> > > > > > > > > Hi guys,
> > > > > > > > >
> > > > > > > > > Would be nice to have some feedback on this as the deadline
> > is
> > > > not
> > > > > > too
> > > > > > > > far :)
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Sohaib
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Sohaib Iftikhar
> > > > > > > > >
> > > > > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <
> > > > > > > sohaib1692@gmail.com
> > > > > > > > <ma...@gmail.com>> wrote:
> > > > > > > > > Thank you for the pointers to the code. This was super
> > helpful.
> > > > The
> > > > > > > > multiple keys can probably be serialized better than
> separating
> > > > them
> > > > > > > with a
> > > > > > > > space but that is already legacy I suppose.
> > > > > > > > >
> > > > > > > > > Firstly filters like bloom or cuckoo are heuristic. They
> can
> > > help
> > > > > > make
> > > > > > > > things faster but definitely cannot be used as the only
> > solution.
> > > > > > Hence,
> > > > > > > in
> > > > > > > > the end, we will still need a persistent keystore/distributed
> > > set.
> > > > My
> > > > > > > plan
> > > > > > > > was to have this keystore as distributed (raft guarantee
> etc.).
> > > The
> > > > > > > > keystore can also hold a persistent filter on its end. If a
> > > broker
> > > > > > > > collapses it can renew/refresh its filter from the keystore.
> > > Hence
> > > > > > > > eliminating the problems about crashes that you mention. The
> > > > problem
> > > > > > here
> > > > > > > > could be in maintaining performance for filters in case of
> > > removals
> > > > > > from
> > > > > > > > the keystore (for eg: sliding windows as mentioned in my
> > previous
> > > > > > mail).
> > > > > > > > Periodic refreshal of filters can help solve this but I am
> open
> > > to
> > > > > > > > suggestions on how to make this better.
> > > > > > > > >
> > > > > > > > > I think implementing a distributed set on the client
> cluster
> > > has
> > > > > its
> > > > > > > > caveats. The way I understand RocketMQ is that we do not have
> > > > control
> > > > > > > over
> > > > > > > > the diskspace/memory on the client end. So we probably only
> > have
> > > a
> > > > > > > constant
> > > > > > > > amount. A distributed set on the client would also need to be
> > > > > > persistent.
> > > > > > > > For eg: if a client restarts/recovers etc. This basically
> means
> > > we
> > > > > > need a
> > > > > > > > keystore on the client instead of the broker cluster. This
> > > probably
> > > > > > puts
> > > > > > > > too much responsibility on the client cluster. A different
> > > approach
> > > > > > would
> > > > > > > > be to ensure that the offsets are always in sync with the
> > broker.
> > > > > Since
> > > > > > > the
> > > > > > > > broker only serves unique messages (based on the proposed
> > > solution
> > > > on
> > > > > > the
> > > > > > > > producer/broker end) all we need to ensure is that a client
> > does
> > > > not
> > > > > > > > consume messages with the same offset twice.
> > > > > > > > >
> > > > > > > > > Please suggest improvements if this does not look like the
> > > > correct
> > > > > > > > approach. Also would be great if someone can come up with a
> > > > > completely
> > > > > > > > different approach so that we can weigh up pros and cons.
> > > > > > > > >
> > > > > > > > > Thanks for reading this through and looking forward to your
> > > > > opinions.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Sohaib
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Sohaib Iftikhar
> > > > > > > > >
> > > > > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <
> > > lizhanhui@gmail.com
> > > > > > > > <ma...@gmail.com>> wrote:
> > > > > > > > > Hi Sohaib,
> > > > > > > > >
> > > > > > > > > About multiple key support, the following code snippet
> should
> > > > > clarify
> > > > > > > > your doubt:
> > > > > > > > > org.apache.rocketmq.common.message.Message class has
> > > overloaded
> > > > > > > setKeys
> > > > > > > > methods, allowing your to set multiple keys via
> > string(separated
> > > by
> > > > > > > > space…sorry, we have not yet unified all separators, hoping
> > this
> > > > does
> > > > > > not
> > > > > > > > confuse you) or collection.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > When broker tries to build index for the message with
> > multiple
> > > > > keys,
> > > > > > > > multiple index entries are inserted into the indexing file.
> > > > > > > > > See org.apache.rocketmq.store.
> index.IndexService#buildIndex
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > In terms of eliminating message duplication, personally, I
> > wish
> > > > we
> > > > > > have
> > > > > > > > a feature of exactly-once semantic covering the whole cluster
> > and
> > > > the
> > > > > > > > complete send-store-consume processes. A rough idea is route
> > the
> > > > > > message
> > > > > > > > according to its unique key to a broker according to a rule;
> > The
> > > > > > serving
> > > > > > > > broker ensures uniqueness of the message according to the
> key(
> > as
> > > > you
> > > > > > > said,
> > > > > > > > bloom-filter/cuckoo-filter, etc);  Things might looks simple,
> > but
> > > > > > issues
> > > > > > > > resides in scenarios where cluster is experiencing membership
> > > > > changes:
> > > > > > > for
> > > > > > > > example, what if a broker crashed down? We might need
> propagate
> > > > > > > > bloom-filter bitset synchronously to other brokers having the
> > > same
> > > > > > > topics;
> > > > > > > > What if a new broker joins in the cluster and starts to
> serve?
> > I
> > > do
> > > > > not
> > > > > > > > mean this is too complex to implement. Instead, this is a
> > pretty
> > > > > > > > interesting topic and fancy feature to have. Alternatively,
> we
> > > > might
> > > > > > > defer
> > > > > > > > eliminating duplicates to the consumption phase using kind of
> > > > > > distributed
> > > > > > > > set. For sure, my proposing idea suffers the same challenges
> > > > > including
> > > > > > > > membership changes.
> > > > > > > > >
> > > > > > > > > Guys of dev board, any insights on this issue?
> > > > > > > > >
> > > > > > > > > Zhanhui Li
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1692@gmail.com
> > > > > <mailto:
> > > > > > > > sohaib1692@gmail.com>> 写道:
> > > > > > > > >>
> > > > > > > > >> Hi Zhanhui,
> > > > > > > > >>
> > > > > > > > >> I have a doubt about these multiple keys. If I am wrong in
> > any
> > > > of
> > > > > > the
> > > > > > > > >> assumptions I make please point it out.
> > > > > > > > >>
> > > > > > > > >> If there is support for multiple keys I cannot see this in
> > the
> > > > > code.
> > > > > > > The
> > > > > > > > >> class Message only stores a single key in the property map
> > > > against
> > > > > > the
> > > > > > > > >> property name "KEYS". Is this also done in the same ways
> as
> > > > tags?
> > > > > > That
> > > > > > > > is
> > > > > > > > >> different keys are separated with ' || '? So basically as
> a
> > > user
> > > > > of
> > > > > > > the
> > > > > > > > >> producer API it is the user's responsibility to ensure
> that
> > he
> > > > > > > separates
> > > > > > > > >> the different keys with the correct separator. I can see
> an
> > > > > obvious
> > > > > > > > problem
> > > > > > > > >> here. What if the key contains this special character ' ||
> > '?
> > > > But
> > > > > > > maybe
> > > > > > > > >> this event is rare and hence this is not important. Could
> > you
> > > > > point
> > > > > > me
> > > > > > > > to
> > > > > > > > >> some source/doc that explains this part? I was looking at
> > the
> > > > > index
> > > > > > > > section
> > > > > > > > >> rocketmq-store but I have not been able to understand the
> > > > indexing
> > > > > > > > process
> > > > > > > > >> completely for now. I will keep reading the source to get
> a
> > > > better
> > > > > > > idea.
> > > > > > > > >>
> > > > > > > > >> Moving on to the implementational details. Here is a broad
> > > idea
> > > > of
> > > > > > one
> > > > > > > > >> possible way to approach it.
> > > > > > > > >>
> > > > > > > > >> The attempt is to remove duplicate messages. In this
> issue,
> > I
> > > > > would
> > > > > > > > like to
> > > > > > > > >> aim at eliminating duplicate messages at the
> producer/broker
> > > > end.
> > > > > > For
> > > > > > > > now,
> > > > > > > > >> we do not concern ourselves with the duplicate messages
> > > > happening
> > > > > > due
> > > > > > > to
> > > > > > > > >> unwritten consumer offsets as these two issues have
> > different
> > > > > > > solutions.
> > > > > > > > >> One way to solve this problem at the producer/broker end
> > could
> > > > be
> > > > > to
> > > > > > > > have a
> > > > > > > > >> distributed key store that stores the messages. We can
> make
> > it
> > > > > > > > configurable
> > > > > > > > >> such that this distributed store stores all messages or
> > works
> > > > as a
> > > > > > > > sliding
> > > > > > > > >> window keeping only the messages from the last X seconds
> > > > specified
> > > > > > by
> > > > > > > > the
> > > > > > > > >> user. We can have a layer on top to check set membership
> > such
> > > > as a
> > > > > > > bloom
> > > > > > > > >> filter or a cuckoo filter (
> > > > > > > > >> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf
> <
> > > > > > > > https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>)
> to
> > > help
> > > > > > > > >> performance. Every message being pushed in by a producer
> are
> > > > > checked
> > > > > > > in
> > > > > > > > >> first with the filter and in case of a positive result
> with
> > > this
> > > > > key
> > > > > > > > store.
> > > > > > > > >> If the message is found then it is discarded. This helps
> > > remove
> > > > > > > > duplicates
> > > > > > > > >> completely from a producer perspective. The core of this
> > idea
> > > is
> > > > > the
> > > > > > > > >> distributed key store which would be completely separate
> > from
> > > > the
> > > > > > > > current
> > > > > > > > >> message storage. Since the concept of a distributed key
> > store
> > > > or a
> > > > > > > > >> key/value store is not novel there are two ways to this.
> > > > > > > > >> 1. Implement it ourselves. This would be high effort but
> no
> > > > > external
> > > > > > > > >> dependencies.
> > > > > > > > >> 2. Use a key-value store such as Redis (which already has
> > > > timeouts
> > > > > > and
> > > > > > > > >> persistence but a large memory footprint) or some other
> > > > disk-based
> > > > > > > > storage
> > > > > > > > >> for set membership. This would include an external
> > dependency
> > > > but
> > > > > > > > >> development time will reduce significantly for such a
> > > solution.
> > > > > > > > >> I am inclined towards implementing it by myself as this
> > would
> > > > > avoid
> > > > > > > > >> dependencies on other products especially since RocketMQ
> is
> > > > > > currently
> > > > > > > a
> > > > > > > > >> self-reliant system. In addition, my past experience with
> > > > building
> > > > > > > such
> > > > > > > > a
> > > > > > > > >> store should also come in handy.
> > > > > > > > >>
> > > > > > > > >> I would like to know the opinions of the development
> > community
> > > > on
> > > > > > this
> > > > > > > > >> approach and to suggest improvements on it. Looking
> forward
> > to
> > > > > your
> > > > > > > > >> responses to this.
> > > > > > > > >>
> > > > > > > > >> ====<question unrelated to issue>=====
> > > > > > > > >> To increase my familiarity with the code base and to help
> > > prove
> > > > > > that I
> > > > > > > > am
> > > > > > > > >> familiar with the tools and technologies in place it would
> > be
> > > > > great
> > > > > > > if I
> > > > > > > > >> could be pointed to some low effort issues that I could
> help
> > > out
> > > > > > with.
> > > > > > > > In
> > > > > > > > >> case there are no 'newbie' issues available I could help
> > > improve
> > > > > the
> > > > > > > > >> comments inside the codebase. I noticed some source files
> > with
> > > > no
> > > > > > > > >> explanations which can be documented via comments to help
> > > > onboard
> > > > > a
> > > > > > > new
> > > > > > > > >> contributor faster.
> > > > > > > > >> ====</question unrelated to issue>=====
> > > > > > > > >>
> > > > > > > > >> Thanks a lot for reading this through and looking forward
> to
> > > > your
> > > > > > > > opinions.
> > > > > > > > >>
> > > > > > > > >> Regards,
> > > > > > > > >> Sohaib
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <
> > > > lizhanhui@gmail.com
> > > > > > > > <ma...@gmail.com>> wrote:
> > > > > > > > >>
> > > > > > > > >>> Hi Sohaib,
> > > > > > > > >>>
> > > > > > > > >>> Happy to know you are interested in RocketMQ.
> > > > > > > > >>>
> > > > > > > > >>> First, let me answer questions you raised.
> > > > > > > > >>>
> > > > > > > > >>> — can there be multiple tags?
> > > > > > > > >>> No. At present, the storage engine allows single tag
> only.
> > > > > > > > Subscriptions
> > > > > > > > >>> are allowed to use combination of tags. The current model
> > > > should
> > > > > > meet
> > > > > > > > your
> > > > > > > > >>> business development. If not, please let us know.
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>> — key (Similar question to above.)
> > > > > > > > >>> RocketMQ builds index using message keys. A single
> message
> > > may
> > > > > have
> > > > > > > > >>> multiple keys.
> > > > > > > > >>>
> > > > > > > > >>> — About redundant message
> > > > > > > > >>> From my understanding, you are trying to eliminate
> > duplicate
> > > > > > > messages.
> > > > > > > > >>> True there are various reasons which may cause message
> > > > > duplication,
> > > > > > > > ranging
> > > > > > > > >>> from message delivery and consumption. Discussion on this
> > > topic
> > > > > is
> > > > > > > > warmly
> > > > > > > > >>> welcome.  Had you had any idea to contribute on this
> issue,
> > > the
> > > > > > > > developer
> > > > > > > > >>> board is happy to discuss.
> > > > > > > > >>>
> > > > > > > > >>> Zhanhui Li
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <
> > sohaib1692@gmail.com
> > > > > > <mailto:
> > > > > > > > sohaib1692@gmail.com>> 写道:
> > > > > > > > >>>>
> > > > > > > > >>>> My earlier email message seems to have gotten lost. So I
> > > will
> > > > > try
> > > > > > > > again.
> > > > > > > > >>>> Please see the original message for the discussion.
> > > > > > > > >>>>
> > > > > > > > >>>> Regards,
> > > > > > > > >>>> Sohaib Iftikhar
> > > > > > > > >>>>
> > > > > > > > >>>> -- Man is still the most extraordinary computer of
> all.--
> > > > > > > > >>>>
> > > > > > > > >>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <
> > > > > > > > sohaib1692@gmail.com <ma...@gmail.com>>
> > > > > > > > >>>> wrote:
> > > > > > > > >>>>
> > > > > > > > >>>>> Hi,
> > > > > > > > >>>>>
> > > > > > > > >>>>> I am interested in working on this issue (
> > > > > > > https://issues.apache.org/
> > > > > > > > <https://issues.apache.org/>
> > > > > > > > >>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a
> > few
> > > > > > > questions
> > > > > > > > for
> > > > > > > > >>>>> the same. I am not sure if this discussion needs to be
> on
> > > the
> > > > > > JIRA
> > > > > > > > >>> issue or
> > > > > > > > >>>>> here. Feel free to correct me if this is the wrong
> > > platform.
> > > > > Also
> > > > > > > > while
> > > > > > > > >>> I
> > > > > > > > >>>>> have worked with distributed pub-sub systems I am still
> > > > fairly
> > > > > > new
> > > > > > > to
> > > > > > > > >>>>> Rocket-MQ so maybe my understanding of it is
> incorrect. I
> > > > > > apologise
> > > > > > > > if
> > > > > > > > >>> that
> > > > > > > > >>>>> is the case and would be happy to stand corrected.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Following are my questions:
> > > > > > > > >>>>> 1. What defines a redundant message?
> > > > > > > > >>>>>   The constructor that I see for a message is as
> follows:
> > > > > > > > >>>>>   Message(String topic, String tags, String keys, int
> > flag,
> > > > > > byte[]
> > > > > > > > >>> body,
> > > > > > > > >>>>> boolean waitStoreMsgOK)
> > > > > > > > >>>>>   Possible candidates to me are topic, tags (can there
> be
> > > > > > multiple
> > > > > > > > >>> tags?
> > > > > > > > >>>>> I could not find an example for this. If yes how are
> they
> > > > > > > > separated?),
> > > > > > > > >>> keys
> > > > > > > > >>>>> (Similar question to above.) and of course the body. Is
> > > there
> > > > > > > > something
> > > > > > > > >>>>> that I have missed in this? Is there something that we
> do
> > > not
> > > > > > need
> > > > > > > to
> > > > > > > > >>>>> consider?
> > > > > > > > >>>>> 2. Is their a timeline on the redundant messages? What
> I
> > > mean
> > > > > by
> > > > > > > > this is
> > > > > > > > >>>>> that is there a time limit after which a message with
> > > similar
> > > > > > > > content is
> > > > > > > > >>>>> allowed. From what I gather there was no such thing
> > > > mentioned.
> > > > > > This
> > > > > > > > >>> would
> > > > > > > > >>>>> mean storing all the messages. Depending on the
> > > requirements
> > > > > this
> > > > > > > > may or
> > > > > > > > >>>>> may not be the best solution. It might be desirable
> that
> > no
> > > > > > > > duplicates
> > > > > > > > >>> are
> > > > > > > > >>>>> needed within a certain time window (sliding). This
> > allows
> > > > > > ignoring
> > > > > > > > of
> > > > > > > > >>>>> duplicate messages that were generated very close to
> each
> > > > other
> > > > > > (or
> > > > > > > > in
> > > > > > > > >>> the
> > > > > > > > >>>>> window indicated). Depending on this requirement
> > > > implementation
> > > > > > may
> > > > > > > > >>> become
> > > > > > > > >>>>> a little bit more involved.
> > > > > > > > >>>>>
> > > > > > > > >>>>> For now, these are the only questions. I have ideas
> that
> > > need
> > > > > > > review
> > > > > > > > >>> about
> > > > > > > > >>>>> possible implementations but I will mention them once
> the
> > > > > > > > specifications
> > > > > > > > >>>>> are clear to me. As an end question, I would at some
> > point
> > > > like
> > > > > > to
> > > > > > > > post
> > > > > > > > >>>>> design ideas to this problem privately to get it
> reviewed
> > > by
> > > > > the
> > > > > > > > >>>>> development community but not make it publicly
> available
> > so
> > > > > that
> > > > > > it
> > > > > > > > >>> cannot
> > > > > > > > >>>>> be plagiarised. What platform/method can I use to do
> > that?
> > > Or
> > > > > is
> > > > > > > > >>> submitting
> > > > > > > > >>>>> a draft to the Google platform the only possible way to
> > > > > > accomplish
> > > > > > > > this?
> > > > > > > > >>>>>
> > > > > > > > >>>>> Thanks a lot for reading this through and looking
> forward
> > > to
> > > > > your
> > > > > > > > >>> inputs.
> > > > > > > > >>>>>
> > > > > > > > >>>>> Regards,
> > > > > > > > >>>>> Sohaib Iftikhar
> > > > > > > > >>>>>
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: 答复: [GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism

Posted by yukon <yu...@apache.org>.
```
Personally, I find RAFT to be much simpler to implement. However, I do not
expect to reinvent the wheel here.
```

That's absolutely right, no need to reinvent the wheel, there are many
existing implementations for raft: https://raft.github.io/

```
I don't think using key store to persist all the messages is a good idea.
```

Yes, store an ID is enough.


On Thu, Mar 8, 2018 at 3:32 PM, Sohaib Iftikhar <so...@gmail.com>
wrote:

> Hi Dexin,
>
> Thank you for your suggestions. I will try to answer as much as I can and
> leave the rest to the RocketMQ team.
>
> 1. The idea with incremental Ids is actually quite good. But @Yukon
> mentioned that duplication can also be controlled by an application
> (special KV Property) in which case different producers may produce the
> same message that needs to deduplicated on the broker.
> SessionId+IncrementalId won't work in this scenario I believe. But we can
> actually switch to more efficient storage using the idea you described when
> the user is not specifying these special keys.
> Also I proposed storing of keys for only a fixed time interval. For all
> practical purposes this would still remain constant time. [Log base 2 of
> 10^10 is still just 33 :) ]. It does add the extra cost of communication
> but this would be the case in both scenarios.
> 2. As for consensus, the ideas I presented were pretty abstract so I
> mentioned a couple of algorithms that could potentially be used.
> Personally, I find RAFT to be much simpler to implement. However, I do not
> expect to reinvent the wheel here. I strongly believe that in this case, we
> can build upon some tested existing solution.
>
>
> Regards,
> Sohaib
>
> On Thu, Mar 8, 2018 at 1:31 AM, 李 德鑫 <de...@outlook.com> wrote:
>
> > Hi Sohaib,
> >
> >
> > I‘m a student applying for GSOC too. And I've read all of your discussion
> > in the mail list.
> >
> > I have some questions about your design, and some of the questions may
> > need to be answered by RocketMQ team. So I send them here to be
> discussed.
> >
> > I don't think using key store to persist all the messages is a good idea.
> > Since MQ is based on O(1) data structure. The key store would harm the
> > performance.
> >
> > I think we can learn from TCP protocol.
> >
> > In Producer-Broker Communication, we can give an incremental id for every
> > message sent in the same session. And the session id should be persistent
> > on the disk for producer. So the broker only need to maintain a map
> between
> > session id to expected message id(And this is how Kafka do it). Since
> > messages are much more than producers. However, there's still a K/V store
> > needed. So we have to ask RocketMQ team about how many producers in the
> > same time while in practical situation.
> >
> > Also, the same idea in Consumer-Broker Communication.
> >
> >
> > About consensus algorithm, I think RocketMQ should already have an
> > implementation there. I don't know what it is, but maybe you can reuse
> > that. Or what if you have to implement one, in my opinion, there's no
> need
> > to implement both Paxos and Raft. Since they solve the same kind of
> > problems.
> >
> >
> >
> > Regards,
> >
> > Dexin
> >
> >
> > ________________________________
> > 发件人: Sohaib Iftikhar <so...@gmail.com>
> > 发送时间: 2018年3月7日 18:15:51
> > 收件人: dev@rocketmq.apache.org
> > 主题: Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery
> > mechanism
> >
> > Hi Yukon,
> >
> > Thanks for your reply. Yes, it would be nice to concretely define the
> scope
> > of this project as the doc is a bit ambitious for just a summer. Should
> you
> > (or anyone else) have questions/suggestions/clarifications I'd be glad
> to
> > discuss more details.
> >
> > Thanks,
> > Sohaib
> >
> > On Wed, Mar 7, 2018 at 8:58 AM, yukon <yu...@apache.org> wrote:
> >
> > > Hi,
> > >
> > > Google doc is better for discussion, your design is great, now we could
> > > discuss more details base on it.
> > >
> > > Any advice is welcome from RocketMQ community.
> > >
> > > Appreciate your efforts.
> > >
> > > Regards,
> > > yukon
> > >
> > > On Wed, Mar 7, 2018 at 5:15 AM, Sohaib Iftikhar <so...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > @Yukon Thank you for your reply. This clears some doubts.
> > > >
> > > > Sorry for the delay as I was somewhat occupied with another project.
> I
> > > have
> > > > created an initial design doc. Email is a bit cumbersome for
> feedback I
> > > > wrote this document in two formats:
> > > >
> > > > 1. In the form of a Google document:
> > > > https://docs.google.com/document/d/1KSpXGNDH0HF5E27lfKJxJnjIjPtlP
> > > > 1Q-M6rj3yZde24.
> > > > The document is open for comments to all users without signing in. I
> > > would
> > > > appreciate it if you put your name before the comment so I can
> identify
> > > who
> > > > to follow up the discussion with.
> > > >
> > > > 2. As a markdown on github:
> > > > https://github.com/sohaibiftikhar/rocketmq/blob/
> > > gsoc_design/gsoc_design.md
> > > > .
> > > > The comments for this can be made on the commit:
> > > > https://github.com/sohaibiftikhar/rocketmq/commit/
> > > > dfd55fc69f430fc024217a3b20dde31717334e62
> > > >
> > > > After I have received a certain amount of feedback I will try to
> > > > incorporate it and put in a subsequent version for review. Please
> tell
> > me
> > > > which methods suits you better (gdoc or github) for review and we can
> > > > continue with that for the subsequent versions.
> > > >
> > > > Lastly, the document is a couple of pages so I appreciate your
> patience
> > > and
> > > > your help.
> > > > Looking forward to your opinions.
> > > >
> > > > Thanks,
> > > > Sohaib
> > > >
> > > > On Mon, Mar 5, 2018 at 1:01 PM, yukon <yu...@apache.org> wrote:
> > > >
> > > > > Hi Sohaib,
> > > > >
> > > > > Sorry for the late reply, we could move this project forward now ~
> > > > >
> > > > > ```
> > > > > I would at some point like to post
> > > > > design ideas to this problem privately to get it reviewed by the
> > > > > development community but not make it publicly available so that it
> > > > cannot
> > > > > be plagiarised.
> > > > > ```
> > > > >
> > > > > You can send your design ideas to me directly or to our PMC list(
> > > > > private@rocketmq.apache.org) if you want to make your ideas
> > privately.
> > > > But
> > > > > please don't break away from the community.
> > > > >
> > > > > I hope you have already understood the goal of this project. Now,
> > > > RocketMQ
> > > > > support At-least-once delivery, it's an obvious solution
> > > > > that achieves Exactly-Once by removing duplicated messages.
> > > > >
> > > > > Return to your original questions:
> > > > >
> > > > > 1. What defines a redundant message?
> > > > >
> > > > > A message id will be generated when new a message, so this id can
> be
> > > used
> > > > > to identify a message. Also, the user could specify a unique
> > > > > business-related property to identify a message.
> > > > >
> > > > > The redundant messages will occur when the network is broken or
> > > > > reconnected, rebalance[1] is triggered, etc.
> > > > >
> > > > >
> > > > > 2. Is their a timeline on the redundant messages?
> > > > >
> > > > > Yes, keep all messages nonredundant is expensive, let's consider
> this
> > > > > question within a certain time window ~
> > > > >
> > > > > Looking forward to your design.
> > > > >
> > > > > [1].
> > > > > https://github.com/apache/rocketmq/blob/master/client/
> > > > > src/main/java/org/apache/rocketmq/client/impl/consumer/
> > > > > RebalanceService.java
> > > > >
> > > > >
> > > > > Regards,
> > > > > yukon
> > > > >
> > > > >
> > > > > On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <
> > sohaib1692@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > @Zhanhui Thanks for the response. This is not a campaign its just
> > > part
> > > > of
> > > > > > GSoC (https://summerofcode.withgoogle.com/). And community help
> is
> > > > > gladly
> > > > > > welcomed. In fact, it is recommended :)
> > > > > >
> > > > > > @KaiYuan Thanks for your suggestions. I will come up with a flow
> > > chart
> > > > > for
> > > > > > the proposed solution this weekend.
> > > > > >
> > > > > > Thanks,
> > > > > > Sohaib
> > > > > >
> > > > > >
> > > > > > On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <li...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Hi Sohaib,
> > > > > > >
> > > > > > > I have been sort of busy this these days. Sorry to reply you so
> > > late!
> > > > > > >
> > > > > > > So sure what “deadline” you are referring to. If this is part
> of
> > a
> > > > > > > campaign, I have to admit I am not aware of the regulations and
> > > what
> > > > > kind
> > > > > > > of help I should offer to maintain fairness considering other
> > > arising
> > > > > > > similar issues.
> > > > > > >
> > > > > > > Regards!
> > > > > > >
> > > > > > > Zhanhui Li
> > > > > > >
> > > > > > >
> > > > > > > > 在 2018年3月1日,上午3:43,Sohaib Iftikhar <so...@gmail.com>
> 写道:
> > > > > > > >
> > > > > > > > Hi guys,
> > > > > > > >
> > > > > > > > Would be nice to have some feedback on this as the deadline
> is
> > > not
> > > > > too
> > > > > > > far :)
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Sohaib
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Sohaib Iftikhar
> > > > > > > >
> > > > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <
> > > > > > sohaib1692@gmail.com
> > > > > > > <ma...@gmail.com>> wrote:
> > > > > > > > Thank you for the pointers to the code. This was super
> helpful.
> > > The
> > > > > > > multiple keys can probably be serialized better than separating
> > > them
> > > > > > with a
> > > > > > > space but that is already legacy I suppose.
> > > > > > > >
> > > > > > > > Firstly filters like bloom or cuckoo are heuristic. They can
> > help
> > > > > make
> > > > > > > things faster but definitely cannot be used as the only
> solution.
> > > > > Hence,
> > > > > > in
> > > > > > > the end, we will still need a persistent keystore/distributed
> > set.
> > > My
> > > > > > plan
> > > > > > > was to have this keystore as distributed (raft guarantee etc.).
> > The
> > > > > > > keystore can also hold a persistent filter on its end. If a
> > broker
> > > > > > > collapses it can renew/refresh its filter from the keystore.
> > Hence
> > > > > > > eliminating the problems about crashes that you mention. The
> > > problem
> > > > > here
> > > > > > > could be in maintaining performance for filters in case of
> > removals
> > > > > from
> > > > > > > the keystore (for eg: sliding windows as mentioned in my
> previous
> > > > > mail).
> > > > > > > Periodic refreshal of filters can help solve this but I am open
> > to
> > > > > > > suggestions on how to make this better.
> > > > > > > >
> > > > > > > > I think implementing a distributed set on the client cluster
> > has
> > > > its
> > > > > > > caveats. The way I understand RocketMQ is that we do not have
> > > control
> > > > > > over
> > > > > > > the diskspace/memory on the client end. So we probably only
> have
> > a
> > > > > > constant
> > > > > > > amount. A distributed set on the client would also need to be
> > > > > persistent.
> > > > > > > For eg: if a client restarts/recovers etc. This basically means
> > we
> > > > > need a
> > > > > > > keystore on the client instead of the broker cluster. This
> > probably
> > > > > puts
> > > > > > > too much responsibility on the client cluster. A different
> > approach
> > > > > would
> > > > > > > be to ensure that the offsets are always in sync with the
> broker.
> > > > Since
> > > > > > the
> > > > > > > broker only serves unique messages (based on the proposed
> > solution
> > > on
> > > > > the
> > > > > > > producer/broker end) all we need to ensure is that a client
> does
> > > not
> > > > > > > consume messages with the same offset twice.
> > > > > > > >
> > > > > > > > Please suggest improvements if this does not look like the
> > > correct
> > > > > > > approach. Also would be great if someone can come up with a
> > > > completely
> > > > > > > different approach so that we can weigh up pros and cons.
> > > > > > > >
> > > > > > > > Thanks for reading this through and looking forward to your
> > > > opinions.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Sohaib
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Sohaib Iftikhar
> > > > > > > >
> > > > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <
> > lizhanhui@gmail.com
> > > > > > > <ma...@gmail.com>> wrote:
> > > > > > > > Hi Sohaib,
> > > > > > > >
> > > > > > > > About multiple key support, the following code snippet should
> > > > clarify
> > > > > > > your doubt:
> > > > > > > > org.apache.rocketmq.common.message.Message class has
> > overloaded
> > > > > > setKeys
> > > > > > > methods, allowing your to set multiple keys via
> string(separated
> > by
> > > > > > > space…sorry, we have not yet unified all separators, hoping
> this
> > > does
> > > > > not
> > > > > > > confuse you) or collection.
> > > > > > > >
> > > > > > > >
> > > > > > > > When broker tries to build index for the message with
> multiple
> > > > keys,
> > > > > > > multiple index entries are inserted into the indexing file.
> > > > > > > > See org.apache.rocketmq.store.index.IndexService#buildIndex
> > > > > > > >
> > > > > > > >
> > > > > > > > In terms of eliminating message duplication, personally, I
> wish
> > > we
> > > > > have
> > > > > > > a feature of exactly-once semantic covering the whole cluster
> and
> > > the
> > > > > > > complete send-store-consume processes. A rough idea is route
> the
> > > > > message
> > > > > > > according to its unique key to a broker according to a rule;
> The
> > > > > serving
> > > > > > > broker ensures uniqueness of the message according to the key(
> as
> > > you
> > > > > > said,
> > > > > > > bloom-filter/cuckoo-filter, etc);  Things might looks simple,
> but
> > > > > issues
> > > > > > > resides in scenarios where cluster is experiencing membership
> > > > changes:
> > > > > > for
> > > > > > > example, what if a broker crashed down? We might need propagate
> > > > > > > bloom-filter bitset synchronously to other brokers having the
> > same
> > > > > > topics;
> > > > > > > What if a new broker joins in the cluster and starts to serve?
> I
> > do
> > > > not
> > > > > > > mean this is too complex to implement. Instead, this is a
> pretty
> > > > > > > interesting topic and fancy feature to have. Alternatively, we
> > > might
> > > > > > defer
> > > > > > > eliminating duplicates to the consumption phase using kind of
> > > > > distributed
> > > > > > > set. For sure, my proposing idea suffers the same challenges
> > > > including
> > > > > > > membership changes.
> > > > > > > >
> > > > > > > > Guys of dev board, any insights on this issue?
> > > > > > > >
> > > > > > > > Zhanhui Li
> > > > > > > >
> > > > > > > >
> > > > > > > >> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1692@gmail.com
> > > > <mailto:
> > > > > > > sohaib1692@gmail.com>> 写道:
> > > > > > > >>
> > > > > > > >> Hi Zhanhui,
> > > > > > > >>
> > > > > > > >> I have a doubt about these multiple keys. If I am wrong in
> any
> > > of
> > > > > the
> > > > > > > >> assumptions I make please point it out.
> > > > > > > >>
> > > > > > > >> If there is support for multiple keys I cannot see this in
> the
> > > > code.
> > > > > > The
> > > > > > > >> class Message only stores a single key in the property map
> > > against
> > > > > the
> > > > > > > >> property name "KEYS". Is this also done in the same ways as
> > > tags?
> > > > > That
> > > > > > > is
> > > > > > > >> different keys are separated with ' || '? So basically as a
> > user
> > > > of
> > > > > > the
> > > > > > > >> producer API it is the user's responsibility to ensure that
> he
> > > > > > separates
> > > > > > > >> the different keys with the correct separator. I can see an
> > > > obvious
> > > > > > > problem
> > > > > > > >> here. What if the key contains this special character ' ||
> '?
> > > But
> > > > > > maybe
> > > > > > > >> this event is rare and hence this is not important. Could
> you
> > > > point
> > > > > me
> > > > > > > to
> > > > > > > >> some source/doc that explains this part? I was looking at
> the
> > > > index
> > > > > > > section
> > > > > > > >> rocketmq-store but I have not been able to understand the
> > > indexing
> > > > > > > process
> > > > > > > >> completely for now. I will keep reading the source to get a
> > > better
> > > > > > idea.
> > > > > > > >>
> > > > > > > >> Moving on to the implementational details. Here is a broad
> > idea
> > > of
> > > > > one
> > > > > > > >> possible way to approach it.
> > > > > > > >>
> > > > > > > >> The attempt is to remove duplicate messages. In this issue,
> I
> > > > would
> > > > > > > like to
> > > > > > > >> aim at eliminating duplicate messages at the producer/broker
> > > end.
> > > > > For
> > > > > > > now,
> > > > > > > >> we do not concern ourselves with the duplicate messages
> > > happening
> > > > > due
> > > > > > to
> > > > > > > >> unwritten consumer offsets as these two issues have
> different
> > > > > > solutions.
> > > > > > > >> One way to solve this problem at the producer/broker end
> could
> > > be
> > > > to
> > > > > > > have a
> > > > > > > >> distributed key store that stores the messages. We can make
> it
> > > > > > > configurable
> > > > > > > >> such that this distributed store stores all messages or
> works
> > > as a
> > > > > > > sliding
> > > > > > > >> window keeping only the messages from the last X seconds
> > > specified
> > > > > by
> > > > > > > the
> > > > > > > >> user. We can have a layer on top to check set membership
> such
> > > as a
> > > > > > bloom
> > > > > > > >> filter or a cuckoo filter (
> > > > > > > >> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf <
> > > > > > > https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>) to
> > help
> > > > > > > >> performance. Every message being pushed in by a producer are
> > > > checked
> > > > > > in
> > > > > > > >> first with the filter and in case of a positive result with
> > this
> > > > key
> > > > > > > store.
> > > > > > > >> If the message is found then it is discarded. This helps
> > remove
> > > > > > > duplicates
> > > > > > > >> completely from a producer perspective. The core of this
> idea
> > is
> > > > the
> > > > > > > >> distributed key store which would be completely separate
> from
> > > the
> > > > > > > current
> > > > > > > >> message storage. Since the concept of a distributed key
> store
> > > or a
> > > > > > > >> key/value store is not novel there are two ways to this.
> > > > > > > >> 1. Implement it ourselves. This would be high effort but no
> > > > external
> > > > > > > >> dependencies.
> > > > > > > >> 2. Use a key-value store such as Redis (which already has
> > > timeouts
> > > > > and
> > > > > > > >> persistence but a large memory footprint) or some other
> > > disk-based
> > > > > > > storage
> > > > > > > >> for set membership. This would include an external
> dependency
> > > but
> > > > > > > >> development time will reduce significantly for such a
> > solution.
> > > > > > > >> I am inclined towards implementing it by myself as this
> would
> > > > avoid
> > > > > > > >> dependencies on other products especially since RocketMQ is
> > > > > currently
> > > > > > a
> > > > > > > >> self-reliant system. In addition, my past experience with
> > > building
> > > > > > such
> > > > > > > a
> > > > > > > >> store should also come in handy.
> > > > > > > >>
> > > > > > > >> I would like to know the opinions of the development
> community
> > > on
> > > > > this
> > > > > > > >> approach and to suggest improvements on it. Looking forward
> to
> > > > your
> > > > > > > >> responses to this.
> > > > > > > >>
> > > > > > > >> ====<question unrelated to issue>=====
> > > > > > > >> To increase my familiarity with the code base and to help
> > prove
> > > > > that I
> > > > > > > am
> > > > > > > >> familiar with the tools and technologies in place it would
> be
> > > > great
> > > > > > if I
> > > > > > > >> could be pointed to some low effort issues that I could help
> > out
> > > > > with.
> > > > > > > In
> > > > > > > >> case there are no 'newbie' issues available I could help
> > improve
> > > > the
> > > > > > > >> comments inside the codebase. I noticed some source files
> with
> > > no
> > > > > > > >> explanations which can be documented via comments to help
> > > onboard
> > > > a
> > > > > > new
> > > > > > > >> contributor faster.
> > > > > > > >> ====</question unrelated to issue>=====
> > > > > > > >>
> > > > > > > >> Thanks a lot for reading this through and looking forward to
> > > your
> > > > > > > opinions.
> > > > > > > >>
> > > > > > > >> Regards,
> > > > > > > >> Sohaib
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <
> > > lizhanhui@gmail.com
> > > > > > > <ma...@gmail.com>> wrote:
> > > > > > > >>
> > > > > > > >>> Hi Sohaib,
> > > > > > > >>>
> > > > > > > >>> Happy to know you are interested in RocketMQ.
> > > > > > > >>>
> > > > > > > >>> First, let me answer questions you raised.
> > > > > > > >>>
> > > > > > > >>> — can there be multiple tags?
> > > > > > > >>> No. At present, the storage engine allows single tag only.
> > > > > > > Subscriptions
> > > > > > > >>> are allowed to use combination of tags. The current model
> > > should
> > > > > meet
> > > > > > > your
> > > > > > > >>> business development. If not, please let us know.
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> — key (Similar question to above.)
> > > > > > > >>> RocketMQ builds index using message keys. A single message
> > may
> > > > have
> > > > > > > >>> multiple keys.
> > > > > > > >>>
> > > > > > > >>> — About redundant message
> > > > > > > >>> From my understanding, you are trying to eliminate
> duplicate
> > > > > > messages.
> > > > > > > >>> True there are various reasons which may cause message
> > > > duplication,
> > > > > > > ranging
> > > > > > > >>> from message delivery and consumption. Discussion on this
> > topic
> > > > is
> > > > > > > warmly
> > > > > > > >>> welcome.  Had you had any idea to contribute on this issue,
> > the
> > > > > > > developer
> > > > > > > >>> board is happy to discuss.
> > > > > > > >>>
> > > > > > > >>> Zhanhui Li
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <
> sohaib1692@gmail.com
> > > > > <mailto:
> > > > > > > sohaib1692@gmail.com>> 写道:
> > > > > > > >>>>
> > > > > > > >>>> My earlier email message seems to have gotten lost. So I
> > will
> > > > try
> > > > > > > again.
> > > > > > > >>>> Please see the original message for the discussion.
> > > > > > > >>>>
> > > > > > > >>>> Regards,
> > > > > > > >>>> Sohaib Iftikhar
> > > > > > > >>>>
> > > > > > > >>>> -- Man is still the most extraordinary computer of all.--
> > > > > > > >>>>
> > > > > > > >>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <
> > > > > > > sohaib1692@gmail.com <ma...@gmail.com>>
> > > > > > > >>>> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> Hi,
> > > > > > > >>>>>
> > > > > > > >>>>> I am interested in working on this issue (
> > > > > > https://issues.apache.org/
> > > > > > > <https://issues.apache.org/>
> > > > > > > >>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a
> few
> > > > > > questions
> > > > > > > for
> > > > > > > >>>>> the same. I am not sure if this discussion needs to be on
> > the
> > > > > JIRA
> > > > > > > >>> issue or
> > > > > > > >>>>> here. Feel free to correct me if this is the wrong
> > platform.
> > > > Also
> > > > > > > while
> > > > > > > >>> I
> > > > > > > >>>>> have worked with distributed pub-sub systems I am still
> > > fairly
> > > > > new
> > > > > > to
> > > > > > > >>>>> Rocket-MQ so maybe my understanding of it is incorrect. I
> > > > > apologise
> > > > > > > if
> > > > > > > >>> that
> > > > > > > >>>>> is the case and would be happy to stand corrected.
> > > > > > > >>>>>
> > > > > > > >>>>> Following are my questions:
> > > > > > > >>>>> 1. What defines a redundant message?
> > > > > > > >>>>>   The constructor that I see for a message is as follows:
> > > > > > > >>>>>   Message(String topic, String tags, String keys, int
> flag,
> > > > > byte[]
> > > > > > > >>> body,
> > > > > > > >>>>> boolean waitStoreMsgOK)
> > > > > > > >>>>>   Possible candidates to me are topic, tags (can there be
> > > > > multiple
> > > > > > > >>> tags?
> > > > > > > >>>>> I could not find an example for this. If yes how are they
> > > > > > > separated?),
> > > > > > > >>> keys
> > > > > > > >>>>> (Similar question to above.) and of course the body. Is
> > there
> > > > > > > something
> > > > > > > >>>>> that I have missed in this? Is there something that we do
> > not
> > > > > need
> > > > > > to
> > > > > > > >>>>> consider?
> > > > > > > >>>>> 2. Is their a timeline on the redundant messages? What I
> > mean
> > > > by
> > > > > > > this is
> > > > > > > >>>>> that is there a time limit after which a message with
> > similar
> > > > > > > content is
> > > > > > > >>>>> allowed. From what I gather there was no such thing
> > > mentioned.
> > > > > This
> > > > > > > >>> would
> > > > > > > >>>>> mean storing all the messages. Depending on the
> > requirements
> > > > this
> > > > > > > may or
> > > > > > > >>>>> may not be the best solution. It might be desirable that
> no
> > > > > > > duplicates
> > > > > > > >>> are
> > > > > > > >>>>> needed within a certain time window (sliding). This
> allows
> > > > > ignoring
> > > > > > > of
> > > > > > > >>>>> duplicate messages that were generated very close to each
> > > other
> > > > > (or
> > > > > > > in
> > > > > > > >>> the
> > > > > > > >>>>> window indicated). Depending on this requirement
> > > implementation
> > > > > may
> > > > > > > >>> become
> > > > > > > >>>>> a little bit more involved.
> > > > > > > >>>>>
> > > > > > > >>>>> For now, these are the only questions. I have ideas that
> > need
> > > > > > review
> > > > > > > >>> about
> > > > > > > >>>>> possible implementations but I will mention them once the
> > > > > > > specifications
> > > > > > > >>>>> are clear to me. As an end question, I would at some
> point
> > > like
> > > > > to
> > > > > > > post
> > > > > > > >>>>> design ideas to this problem privately to get it reviewed
> > by
> > > > the
> > > > > > > >>>>> development community but not make it publicly available
> so
> > > > that
> > > > > it
> > > > > > > >>> cannot
> > > > > > > >>>>> be plagiarised. What platform/method can I use to do
> that?
> > Or
> > > > is
> > > > > > > >>> submitting
> > > > > > > >>>>> a draft to the Google platform the only possible way to
> > > > > accomplish
> > > > > > > this?
> > > > > > > >>>>>
> > > > > > > >>>>> Thanks a lot for reading this through and looking forward
> > to
> > > > your
> > > > > > > >>> inputs.
> > > > > > > >>>>>
> > > > > > > >>>>> Regards,
> > > > > > > >>>>> Sohaib Iftikhar
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: 答复: [GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism

Posted by Sohaib Iftikhar <so...@gmail.com>.
Hi Dexin,

Thank you for your suggestions. I will try to answer as much as I can and
leave the rest to the RocketMQ team.

1. The idea with incremental Ids is actually quite good. But @Yukon
mentioned that duplication can also be controlled by an application
(special KV Property) in which case different producers may produce the
same message that needs to deduplicated on the broker.
SessionId+IncrementalId won't work in this scenario I believe. But we can
actually switch to more efficient storage using the idea you described when
the user is not specifying these special keys.
Also I proposed storing of keys for only a fixed time interval. For all
practical purposes this would still remain constant time. [Log base 2 of
10^10 is still just 33 :) ]. It does add the extra cost of communication
but this would be the case in both scenarios.
2. As for consensus, the ideas I presented were pretty abstract so I
mentioned a couple of algorithms that could potentially be used.
Personally, I find RAFT to be much simpler to implement. However, I do not
expect to reinvent the wheel here. I strongly believe that in this case, we
can build upon some tested existing solution.


Regards,
Sohaib

On Thu, Mar 8, 2018 at 1:31 AM, 李 德鑫 <de...@outlook.com> wrote:

> Hi Sohaib,
>
>
> I‘m a student applying for GSOC too. And I've read all of your discussion
> in the mail list.
>
> I have some questions about your design, and some of the questions may
> need to be answered by RocketMQ team. So I send them here to be discussed.
>
> I don't think using key store to persist all the messages is a good idea.
> Since MQ is based on O(1) data structure. The key store would harm the
> performance.
>
> I think we can learn from TCP protocol.
>
> In Producer-Broker Communication, we can give an incremental id for every
> message sent in the same session. And the session id should be persistent
> on the disk for producer. So the broker only need to maintain a map between
> session id to expected message id(And this is how Kafka do it). Since
> messages are much more than producers. However, there's still a K/V store
> needed. So we have to ask RocketMQ team about how many producers in the
> same time while in practical situation.
>
> Also, the same idea in Consumer-Broker Communication.
>
>
> About consensus algorithm, I think RocketMQ should already have an
> implementation there. I don't know what it is, but maybe you can reuse
> that. Or what if you have to implement one, in my opinion, there's no need
> to implement both Paxos and Raft. Since they solve the same kind of
> problems.
>
>
>
> Regards,
>
> Dexin
>
>
> ________________________________
> 发件人: Sohaib Iftikhar <so...@gmail.com>
> 发送时间: 2018年3月7日 18:15:51
> 收件人: dev@rocketmq.apache.org
> 主题: Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery
> mechanism
>
> Hi Yukon,
>
> Thanks for your reply. Yes, it would be nice to concretely define the scope
> of this project as the doc is a bit ambitious for just a summer. Should you
> (or anyone else) have questions/suggestions/clarifications I'd be glad to
> discuss more details.
>
> Thanks,
> Sohaib
>
> On Wed, Mar 7, 2018 at 8:58 AM, yukon <yu...@apache.org> wrote:
>
> > Hi,
> >
> > Google doc is better for discussion, your design is great, now we could
> > discuss more details base on it.
> >
> > Any advice is welcome from RocketMQ community.
> >
> > Appreciate your efforts.
> >
> > Regards,
> > yukon
> >
> > On Wed, Mar 7, 2018 at 5:15 AM, Sohaib Iftikhar <so...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > @Yukon Thank you for your reply. This clears some doubts.
> > >
> > > Sorry for the delay as I was somewhat occupied with another project. I
> > have
> > > created an initial design doc. Email is a bit cumbersome for feedback I
> > > wrote this document in two formats:
> > >
> > > 1. In the form of a Google document:
> > > https://docs.google.com/document/d/1KSpXGNDH0HF5E27lfKJxJnjIjPtlP
> > > 1Q-M6rj3yZde24.
> > > The document is open for comments to all users without signing in. I
> > would
> > > appreciate it if you put your name before the comment so I can identify
> > who
> > > to follow up the discussion with.
> > >
> > > 2. As a markdown on github:
> > > https://github.com/sohaibiftikhar/rocketmq/blob/
> > gsoc_design/gsoc_design.md
> > > .
> > > The comments for this can be made on the commit:
> > > https://github.com/sohaibiftikhar/rocketmq/commit/
> > > dfd55fc69f430fc024217a3b20dde31717334e62
> > >
> > > After I have received a certain amount of feedback I will try to
> > > incorporate it and put in a subsequent version for review. Please tell
> me
> > > which methods suits you better (gdoc or github) for review and we can
> > > continue with that for the subsequent versions.
> > >
> > > Lastly, the document is a couple of pages so I appreciate your patience
> > and
> > > your help.
> > > Looking forward to your opinions.
> > >
> > > Thanks,
> > > Sohaib
> > >
> > > On Mon, Mar 5, 2018 at 1:01 PM, yukon <yu...@apache.org> wrote:
> > >
> > > > Hi Sohaib,
> > > >
> > > > Sorry for the late reply, we could move this project forward now ~
> > > >
> > > > ```
> > > > I would at some point like to post
> > > > design ideas to this problem privately to get it reviewed by the
> > > > development community but not make it publicly available so that it
> > > cannot
> > > > be plagiarised.
> > > > ```
> > > >
> > > > You can send your design ideas to me directly or to our PMC list(
> > > > private@rocketmq.apache.org) if you want to make your ideas
> privately.
> > > But
> > > > please don't break away from the community.
> > > >
> > > > I hope you have already understood the goal of this project. Now,
> > > RocketMQ
> > > > support At-least-once delivery, it's an obvious solution
> > > > that achieves Exactly-Once by removing duplicated messages.
> > > >
> > > > Return to your original questions:
> > > >
> > > > 1. What defines a redundant message?
> > > >
> > > > A message id will be generated when new a message, so this id can be
> > used
> > > > to identify a message. Also, the user could specify a unique
> > > > business-related property to identify a message.
> > > >
> > > > The redundant messages will occur when the network is broken or
> > > > reconnected, rebalance[1] is triggered, etc.
> > > >
> > > >
> > > > 2. Is their a timeline on the redundant messages?
> > > >
> > > > Yes, keep all messages nonredundant is expensive, let's consider this
> > > > question within a certain time window ~
> > > >
> > > > Looking forward to your design.
> > > >
> > > > [1].
> > > > https://github.com/apache/rocketmq/blob/master/client/
> > > > src/main/java/org/apache/rocketmq/client/impl/consumer/
> > > > RebalanceService.java
> > > >
> > > >
> > > > Regards,
> > > > yukon
> > > >
> > > >
> > > > On Fri, Mar 2, 2018 at 9:31 PM, Sohaib Iftikhar <
> sohaib1692@gmail.com>
> > > > wrote:
> > > >
> > > > > @Zhanhui Thanks for the response. This is not a campaign its just
> > part
> > > of
> > > > > GSoC (https://summerofcode.withgoogle.com/). And community help is
> > > > gladly
> > > > > welcomed. In fact, it is recommended :)
> > > > >
> > > > > @KaiYuan Thanks for your suggestions. I will come up with a flow
> > chart
> > > > for
> > > > > the proposed solution this weekend.
> > > > >
> > > > > Thanks,
> > > > > Sohaib
> > > > >
> > > > >
> > > > > On Fri, Mar 2, 2018 at 3:41 AM, Zhanhui Li <li...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Sohaib,
> > > > > >
> > > > > > I have been sort of busy this these days. Sorry to reply you so
> > late!
> > > > > >
> > > > > > So sure what “deadline” you are referring to. If this is part of
> a
> > > > > > campaign, I have to admit I am not aware of the regulations and
> > what
> > > > kind
> > > > > > of help I should offer to maintain fairness considering other
> > arising
> > > > > > similar issues.
> > > > > >
> > > > > > Regards!
> > > > > >
> > > > > > Zhanhui Li
> > > > > >
> > > > > >
> > > > > > > 在 2018年3月1日,上午3:43,Sohaib Iftikhar <so...@gmail.com> 写道:
> > > > > > >
> > > > > > > Hi guys,
> > > > > > >
> > > > > > > Would be nice to have some feedback on this as the deadline is
> > not
> > > > too
> > > > > > far :)
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Sohaib
> > > > > > >
> > > > > > > Regards,
> > > > > > > Sohaib Iftikhar
> > > > > > >
> > > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <
> > > > > sohaib1692@gmail.com
> > > > > > <ma...@gmail.com>> wrote:
> > > > > > > Thank you for the pointers to the code. This was super helpful.
> > The
> > > > > > multiple keys can probably be serialized better than separating
> > them
> > > > > with a
> > > > > > space but that is already legacy I suppose.
> > > > > > >
> > > > > > > Firstly filters like bloom or cuckoo are heuristic. They can
> help
> > > > make
> > > > > > things faster but definitely cannot be used as the only solution.
> > > > Hence,
> > > > > in
> > > > > > the end, we will still need a persistent keystore/distributed
> set.
> > My
> > > > > plan
> > > > > > was to have this keystore as distributed (raft guarantee etc.).
> The
> > > > > > keystore can also hold a persistent filter on its end. If a
> broker
> > > > > > collapses it can renew/refresh its filter from the keystore.
> Hence
> > > > > > eliminating the problems about crashes that you mention. The
> > problem
> > > > here
> > > > > > could be in maintaining performance for filters in case of
> removals
> > > > from
> > > > > > the keystore (for eg: sliding windows as mentioned in my previous
> > > > mail).
> > > > > > Periodic refreshal of filters can help solve this but I am open
> to
> > > > > > suggestions on how to make this better.
> > > > > > >
> > > > > > > I think implementing a distributed set on the client cluster
> has
> > > its
> > > > > > caveats. The way I understand RocketMQ is that we do not have
> > control
> > > > > over
> > > > > > the diskspace/memory on the client end. So we probably only have
> a
> > > > > constant
> > > > > > amount. A distributed set on the client would also need to be
> > > > persistent.
> > > > > > For eg: if a client restarts/recovers etc. This basically means
> we
> > > > need a
> > > > > > keystore on the client instead of the broker cluster. This
> probably
> > > > puts
> > > > > > too much responsibility on the client cluster. A different
> approach
> > > > would
> > > > > > be to ensure that the offsets are always in sync with the broker.
> > > Since
> > > > > the
> > > > > > broker only serves unique messages (based on the proposed
> solution
> > on
> > > > the
> > > > > > producer/broker end) all we need to ensure is that a client does
> > not
> > > > > > consume messages with the same offset twice.
> > > > > > >
> > > > > > > Please suggest improvements if this does not look like the
> > correct
> > > > > > approach. Also would be great if someone can come up with a
> > > completely
> > > > > > different approach so that we can weigh up pros and cons.
> > > > > > >
> > > > > > > Thanks for reading this through and looking forward to your
> > > opinions.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Sohaib
> > > > > > >
> > > > > > > Regards,
> > > > > > > Sohaib Iftikhar
> > > > > > >
> > > > > > > -- Man is still the most extraordinary computer of all.--
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <
> lizhanhui@gmail.com
> > > > > > <ma...@gmail.com>> wrote:
> > > > > > > Hi Sohaib,
> > > > > > >
> > > > > > > About multiple key support, the following code snippet should
> > > clarify
> > > > > > your doubt:
> > > > > > > org.apache.rocketmq.common.message.Message class has
> overloaded
> > > > > setKeys
> > > > > > methods, allowing your to set multiple keys via string(separated
> by
> > > > > > space…sorry, we have not yet unified all separators, hoping this
> > does
> > > > not
> > > > > > confuse you) or collection.
> > > > > > >
> > > > > > >
> > > > > > > When broker tries to build index for the message with multiple
> > > keys,
> > > > > > multiple index entries are inserted into the indexing file.
> > > > > > > See org.apache.rocketmq.store.index.IndexService#buildIndex
> > > > > > >
> > > > > > >
> > > > > > > In terms of eliminating message duplication, personally, I wish
> > we
> > > > have
> > > > > > a feature of exactly-once semantic covering the whole cluster and
> > the
> > > > > > complete send-store-consume processes. A rough idea is route the
> > > > message
> > > > > > according to its unique key to a broker according to a rule; The
> > > > serving
> > > > > > broker ensures uniqueness of the message according to the key( as
> > you
> > > > > said,
> > > > > > bloom-filter/cuckoo-filter, etc);  Things might looks simple, but
> > > > issues
> > > > > > resides in scenarios where cluster is experiencing membership
> > > changes:
> > > > > for
> > > > > > example, what if a broker crashed down? We might need propagate
> > > > > > bloom-filter bitset synchronously to other brokers having the
> same
> > > > > topics;
> > > > > > What if a new broker joins in the cluster and starts to serve? I
> do
> > > not
> > > > > > mean this is too complex to implement. Instead, this is a pretty
> > > > > > interesting topic and fancy feature to have. Alternatively, we
> > might
> > > > > defer
> > > > > > eliminating duplicates to the consumption phase using kind of
> > > > distributed
> > > > > > set. For sure, my proposing idea suffers the same challenges
> > > including
> > > > > > membership changes.
> > > > > > >
> > > > > > > Guys of dev board, any insights on this issue?
> > > > > > >
> > > > > > > Zhanhui Li
> > > > > > >
> > > > > > >
> > > > > > >> 在 2018年2月26日,上午2:47,Sohaib Iftikhar <sohaib1692@gmail.com
> > > <mailto:
> > > > > > sohaib1692@gmail.com>> 写道:
> > > > > > >>
> > > > > > >> Hi Zhanhui,
> > > > > > >>
> > > > > > >> I have a doubt about these multiple keys. If I am wrong in any
> > of
> > > > the
> > > > > > >> assumptions I make please point it out.
> > > > > > >>
> > > > > > >> If there is support for multiple keys I cannot see this in the
> > > code.
> > > > > The
> > > > > > >> class Message only stores a single key in the property map
> > against
> > > > the
> > > > > > >> property name "KEYS". Is this also done in the same ways as
> > tags?
> > > > That
> > > > > > is
> > > > > > >> different keys are separated with ' || '? So basically as a
> user
> > > of
> > > > > the
> > > > > > >> producer API it is the user's responsibility to ensure that he
> > > > > separates
> > > > > > >> the different keys with the correct separator. I can see an
> > > obvious
> > > > > > problem
> > > > > > >> here. What if the key contains this special character ' || '?
> > But
> > > > > maybe
> > > > > > >> this event is rare and hence this is not important. Could you
> > > point
> > > > me
> > > > > > to
> > > > > > >> some source/doc that explains this part? I was looking at the
> > > index
> > > > > > section
> > > > > > >> rocketmq-store but I have not been able to understand the
> > indexing
> > > > > > process
> > > > > > >> completely for now. I will keep reading the source to get a
> > better
> > > > > idea.
> > > > > > >>
> > > > > > >> Moving on to the implementational details. Here is a broad
> idea
> > of
> > > > one
> > > > > > >> possible way to approach it.
> > > > > > >>
> > > > > > >> The attempt is to remove duplicate messages. In this issue, I
> > > would
> > > > > > like to
> > > > > > >> aim at eliminating duplicate messages at the producer/broker
> > end.
> > > > For
> > > > > > now,
> > > > > > >> we do not concern ourselves with the duplicate messages
> > happening
> > > > due
> > > > > to
> > > > > > >> unwritten consumer offsets as these two issues have different
> > > > > solutions.
> > > > > > >> One way to solve this problem at the producer/broker end could
> > be
> > > to
> > > > > > have a
> > > > > > >> distributed key store that stores the messages. We can make it
> > > > > > configurable
> > > > > > >> such that this distributed store stores all messages or works
> > as a
> > > > > > sliding
> > > > > > >> window keeping only the messages from the last X seconds
> > specified
> > > > by
> > > > > > the
> > > > > > >> user. We can have a layer on top to check set membership such
> > as a
> > > > > bloom
> > > > > > >> filter or a cuckoo filter (
> > > > > > >> https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf <
> > > > > > https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf>) to
> help
> > > > > > >> performance. Every message being pushed in by a producer are
> > > checked
> > > > > in
> > > > > > >> first with the filter and in case of a positive result with
> this
> > > key
> > > > > > store.
> > > > > > >> If the message is found then it is discarded. This helps
> remove
> > > > > > duplicates
> > > > > > >> completely from a producer perspective. The core of this idea
> is
> > > the
> > > > > > >> distributed key store which would be completely separate from
> > the
> > > > > > current
> > > > > > >> message storage. Since the concept of a distributed key store
> > or a
> > > > > > >> key/value store is not novel there are two ways to this.
> > > > > > >> 1. Implement it ourselves. This would be high effort but no
> > > external
> > > > > > >> dependencies.
> > > > > > >> 2. Use a key-value store such as Redis (which already has
> > timeouts
> > > > and
> > > > > > >> persistence but a large memory footprint) or some other
> > disk-based
> > > > > > storage
> > > > > > >> for set membership. This would include an external dependency
> > but
> > > > > > >> development time will reduce significantly for such a
> solution.
> > > > > > >> I am inclined towards implementing it by myself as this would
> > > avoid
> > > > > > >> dependencies on other products especially since RocketMQ is
> > > > currently
> > > > > a
> > > > > > >> self-reliant system. In addition, my past experience with
> > building
> > > > > such
> > > > > > a
> > > > > > >> store should also come in handy.
> > > > > > >>
> > > > > > >> I would like to know the opinions of the development community
> > on
> > > > this
> > > > > > >> approach and to suggest improvements on it. Looking forward to
> > > your
> > > > > > >> responses to this.
> > > > > > >>
> > > > > > >> ====<question unrelated to issue>=====
> > > > > > >> To increase my familiarity with the code base and to help
> prove
> > > > that I
> > > > > > am
> > > > > > >> familiar with the tools and technologies in place it would be
> > > great
> > > > > if I
> > > > > > >> could be pointed to some low effort issues that I could help
> out
> > > > with.
> > > > > > In
> > > > > > >> case there are no 'newbie' issues available I could help
> improve
> > > the
> > > > > > >> comments inside the codebase. I noticed some source files with
> > no
> > > > > > >> explanations which can be documented via comments to help
> > onboard
> > > a
> > > > > new
> > > > > > >> contributor faster.
> > > > > > >> ====</question unrelated to issue>=====
> > > > > > >>
> > > > > > >> Thanks a lot for reading this through and looking forward to
> > your
> > > > > > opinions.
> > > > > > >>
> > > > > > >> Regards,
> > > > > > >> Sohaib
> > > > > > >>
> > > > > > >>
> > > > > > >> On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <
> > lizhanhui@gmail.com
> > > > > > <ma...@gmail.com>> wrote:
> > > > > > >>
> > > > > > >>> Hi Sohaib,
> > > > > > >>>
> > > > > > >>> Happy to know you are interested in RocketMQ.
> > > > > > >>>
> > > > > > >>> First, let me answer questions you raised.
> > > > > > >>>
> > > > > > >>> — can there be multiple tags?
> > > > > > >>> No. At present, the storage engine allows single tag only.
> > > > > > Subscriptions
> > > > > > >>> are allowed to use combination of tags. The current model
> > should
> > > > meet
> > > > > > your
> > > > > > >>> business development. If not, please let us know.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> — key (Similar question to above.)
> > > > > > >>> RocketMQ builds index using message keys. A single message
> may
> > > have
> > > > > > >>> multiple keys.
> > > > > > >>>
> > > > > > >>> — About redundant message
> > > > > > >>> From my understanding, you are trying to eliminate duplicate
> > > > > messages.
> > > > > > >>> True there are various reasons which may cause message
> > > duplication,
> > > > > > ranging
> > > > > > >>> from message delivery and consumption. Discussion on this
> topic
> > > is
> > > > > > warmly
> > > > > > >>> welcome.  Had you had any idea to contribute on this issue,
> the
> > > > > > developer
> > > > > > >>> board is happy to discuss.
> > > > > > >>>
> > > > > > >>> Zhanhui Li
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>> 在 2018年2月24日,上午11:17,Sohaib Iftikhar <sohaib1692@gmail.com
> > > > <mailto:
> > > > > > sohaib1692@gmail.com>> 写道:
> > > > > > >>>>
> > > > > > >>>> My earlier email message seems to have gotten lost. So I
> will
> > > try
> > > > > > again.
> > > > > > >>>> Please see the original message for the discussion.
> > > > > > >>>>
> > > > > > >>>> Regards,
> > > > > > >>>> Sohaib Iftikhar
> > > > > > >>>>
> > > > > > >>>> -- Man is still the most extraordinary computer of all.--
> > > > > > >>>>
> > > > > > >>>> On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <
> > > > > > sohaib1692@gmail.com <ma...@gmail.com>>
> > > > > > >>>> wrote:
> > > > > > >>>>
> > > > > > >>>>> Hi,
> > > > > > >>>>>
> > > > > > >>>>> I am interested in working on this issue (
> > > > > https://issues.apache.org/
> > > > > > <https://issues.apache.org/>
> > > > > > >>>>> jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a few
> > > > > questions
> > > > > > for
> > > > > > >>>>> the same. I am not sure if this discussion needs to be on
> the
> > > > JIRA
> > > > > > >>> issue or
> > > > > > >>>>> here. Feel free to correct me if this is the wrong
> platform.
> > > Also
> > > > > > while
> > > > > > >>> I
> > > > > > >>>>> have worked with distributed pub-sub systems I am still
> > fairly
> > > > new
> > > > > to
> > > > > > >>>>> Rocket-MQ so maybe my understanding of it is incorrect. I
> > > > apologise
> > > > > > if
> > > > > > >>> that
> > > > > > >>>>> is the case and would be happy to stand corrected.
> > > > > > >>>>>
> > > > > > >>>>> Following are my questions:
> > > > > > >>>>> 1. What defines a redundant message?
> > > > > > >>>>>   The constructor that I see for a message is as follows:
> > > > > > >>>>>   Message(String topic, String tags, String keys, int flag,
> > > > byte[]
> > > > > > >>> body,
> > > > > > >>>>> boolean waitStoreMsgOK)
> > > > > > >>>>>   Possible candidates to me are topic, tags (can there be
> > > > multiple
> > > > > > >>> tags?
> > > > > > >>>>> I could not find an example for this. If yes how are they
> > > > > > separated?),
> > > > > > >>> keys
> > > > > > >>>>> (Similar question to above.) and of course the body. Is
> there
> > > > > > something
> > > > > > >>>>> that I have missed in this? Is there something that we do
> not
> > > > need
> > > > > to
> > > > > > >>>>> consider?
> > > > > > >>>>> 2. Is their a timeline on the redundant messages? What I
> mean
> > > by
> > > > > > this is
> > > > > > >>>>> that is there a time limit after which a message with
> similar
> > > > > > content is
> > > > > > >>>>> allowed. From what I gather there was no such thing
> > mentioned.
> > > > This
> > > > > > >>> would
> > > > > > >>>>> mean storing all the messages. Depending on the
> requirements
> > > this
> > > > > > may or
> > > > > > >>>>> may not be the best solution. It might be desirable that no
> > > > > > duplicates
> > > > > > >>> are
> > > > > > >>>>> needed within a certain time window (sliding). This allows
> > > > ignoring
> > > > > > of
> > > > > > >>>>> duplicate messages that were generated very close to each
> > other
> > > > (or
> > > > > > in
> > > > > > >>> the
> > > > > > >>>>> window indicated). Depending on this requirement
> > implementation
> > > > may
> > > > > > >>> become
> > > > > > >>>>> a little bit more involved.
> > > > > > >>>>>
> > > > > > >>>>> For now, these are the only questions. I have ideas that
> need
> > > > > review
> > > > > > >>> about
> > > > > > >>>>> possible implementations but I will mention them once the
> > > > > > specifications
> > > > > > >>>>> are clear to me. As an end question, I would at some point
> > like
> > > > to
> > > > > > post
> > > > > > >>>>> design ideas to this problem privately to get it reviewed
> by
> > > the
> > > > > > >>>>> development community but not make it publicly available so
> > > that
> > > > it
> > > > > > >>> cannot
> > > > > > >>>>> be plagiarised. What platform/method can I use to do that?
> Or
> > > is
> > > > > > >>> submitting
> > > > > > >>>>> a draft to the Google platform the only possible way to
> > > > accomplish
> > > > > > this?
> > > > > > >>>>>
> > > > > > >>>>> Thanks a lot for reading this through and looking forward
> to
> > > your
> > > > > > >>> inputs.
> > > > > > >>>>>
> > > > > > >>>>> Regards,
> > > > > > >>>>> Sohaib Iftikhar
> > > > > > >>>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>