You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@rocketmq.apache.org by KaiYuan Yang <ma...@alibaba-inc.com> on 2018/03/01 08:19:00 UTC

Re:[GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism

Hi, Sohaib

No message got lost has a higher priority in RocketMQ, so there are some “waiting for response” or “retry” logic. In consequence there is the “redundant message” problem. 

I think you’d better point out the “redundant message” parts in the current architecture and describe your solution with a flow chart first.

Then that's convenient for other people to give their opinion.
Regards,Mark------------------------------------------------------------------发件人:Sohaib Iftikhar <so...@gmail.com>发送时间:2018年3月1日(星期四) 03:43收件人:dev <de...@rocketmq.apache.org>主 题:Re: [GSOC|ROCKETMQ-124] Support non-redundant message delivery mechanism
Hi guys,
Would be nice to have some feedback on this as the deadline is not too far :)
Thanks,Sohaib
Regards,
Sohaib Iftikhar
--
Man is still the most extraordinary computer of all.--
On Mon, Feb 26, 2018 at 10:36 AM, Sohaib Iftikhar <so...@gmail.com> wrote:
Thank you for the pointers to the code. This was super helpful. The multiple keys can probably be serialized better than separating them with a space but that is already legacy I suppose.
Firstly filters like bloom or cuckoo are heuristic. They can help make things faster but definitely cannot be used as the only solution. Hence, in the end, we will still need a persistent keystore/distributed set. My plan was to have this keystore as distributed (raft guarantee etc.). The keystore can also hold a persistent filter on its end. If a broker collapses it can renew/refresh its filter from the keystore. Hence eliminating the problems about crashes that you mention. The problem here could be in maintaining performance for filters in case of removals from the keystore (for eg: sliding windows as mentioned in my previous mail). Periodic refreshal of filters can help solve this but I am open to suggestions on how to make this better.
I think implementing a distributed set on the client cluster has its caveats. The way I understand RocketMQ is that we do not have control over the diskspace/memory on the client end. So we probably only have a constant amount. A distributed set on the client would also need to be persistent. For eg: if a client restarts/recovers etc. This basically means we need a keystore on the client instead of the broker cluster. This probably puts too much responsibility on the client cluster. A different approach would be to ensure that the offsets are always in sync with the broker. Since the broker only serves unique messages (based on the proposed solution on the producer/broker end) all we need to ensure is that a client does not consume messages with the same offset twice.
Please suggest improvements if this does not look like the correct approach. Also would be great if someone can come up with a completely different approach so that we can weigh up pros and cons.
Thanks for reading this through and looking forward to your opinions.
Regards,Sohaib
Regards,
Sohaib Iftikhar
--
Man is still the most extraordinary computer of all.--
On Mon, Feb 26, 2018 at 3:58 AM, Zhanhui Li <li...@gmail.com> wrote:
Hi Sohaib,About multiple key support, the following code snippet should clarify your doubt:org.apache.rocketmq.common.message.Message class has overloaded setKeys methods, allowing your to set multiple keys via string(separated by space…sorry, we have not yet unified all separators, hoping this does not confuse you) or collection.
When broker tries to build index for the message with multiple keys, multiple index entries are inserted into the indexing file. See org.apache.rocketmq.store.index.IndexService#buildIndex
In terms of eliminating message duplication, personally, I wish we have a feature of exactly-once semantic covering the whole cluster and the complete send-store-consume processes. A rough idea is route the message according to its unique key to a broker according to a rule; The serving broker ensures uniqueness of the message according to the key( as you said, bloom-filter/cuckoo-filter, etc);  Things might looks simple, but issues resides in scenarios where cluster is experiencing membership changes: for example, what if a broker crashed down? We might need propagate bloom-filter bitset synchronously to other brokers having the same topics; What if a new broker joins in the cluster and starts to serve? I do not mean this is too complex to implement. Instead, this is a pretty interesting topic and fancy feature to have. Alternatively, we might defer eliminating duplicates to the consumption phase using kind of distributed set. For sure, my proposing idea suffers the same challenges including membership changes. Guys of dev board, any insights on this issue?Zhanhui Li

在 2018年2月26日,上午2:47,Sohaib Iftikhar <so...@gmail.com> 写道:
Hi Zhanhui,

I have a doubt about these multiple keys. If I am wrong in any of the
assumptions I make please point it out.

If there is support for multiple keys I cannot see this in the code. The
class Message only stores a single key in the property map against the
property name "KEYS". Is this also done in the same ways as tags? That is
different keys are separated with ' || '? So basically as a user of the
producer API it is the user's responsibility to ensure that he separates
the different keys with the correct separator. I can see an obvious problem
here. What if the key contains this special character ' || '? But maybe
this event is rare and hence this is not important. Could you point me to
some source/doc that explains this part? I was looking at the index section
rocketmq-store but I have not been able to understand the indexing process
completely for now. I will keep reading the source to get a better idea.

Moving on to the implementational details. Here is a broad idea of one
possible way to approach it.

The attempt is to remove duplicate messages. In this issue, I would like to
aim at eliminating duplicate messages at the producer/broker end. For now,
we do not concern ourselves with the duplicate messages happening due to
unwritten consumer offsets as these two issues have different solutions.
One way to solve this problem at the producer/broker end could be to have a
distributed key store that stores the messages. We can make it configurable
such that this distributed store stores all messages or works as a sliding
window keeping only the messages from the last X seconds specified by the
user. We can have a layer on top to check set membership such as a bloom
filter or a cuckoo filter (
https://www.cs.cmu.edu/~dga/papers/cuckoo-conext2014.pdf) to help
performance. Every message being pushed in by a producer are checked in
first with the filter and in case of a positive result with this key store.
If the message is found then it is discarded. This helps remove duplicates
completely from a producer perspective. The core of this idea is the
distributed key store which would be completely separate from the current
message storage. Since the concept of a distributed key store or a
key/value store is not novel there are two ways to this.
1. Implement it ourselves. This would be high effort but no external
dependencies.
2. Use a key-value store such as Redis (which already has timeouts and
persistence but a large memory footprint) or some other disk-based storage
for set membership. This would include an external dependency but
development time will reduce significantly for such a solution.
I am inclined towards implementing it by myself as this would avoid
dependencies on other products especially since RocketMQ is currently a
self-reliant system. In addition, my past experience with building such a
store should also come in handy.

I would like to know the opinions of the development community on this
approach and to suggest improvements on it. Looking forward to your
responses to this.

====<question unrelated to issue>=====
To increase my familiarity with the code base and to help prove that I am
familiar with the tools and technologies in place it would be great if I
could be pointed to some low effort issues that I could help out with. In
case there are no 'newbie' issues available I could help improve the
comments inside the codebase. I noticed some source files with no
explanations which can be documented via comments to help onboard a new
contributor faster.
====</question unrelated to issue>=====

Thanks a lot for reading this through and looking forward to your opinions.

Regards,
Sohaib


On Sat, Feb 24, 2018 at 11:50 AM, Zhanhui Li <li...@gmail.com> wrote:

Hi Sohaib,

Happy to know you are interested in RocketMQ.

First, let me answer questions you raised.

— can there be multiple tags?
No. At present, the storage engine allows single tag only. Subscriptions
are allowed to use combination of tags. The current model should meet your
business development. If not, please let us know.


— key (Similar question to above.)
RocketMQ builds index using message keys. A single message may have
multiple keys.

— About redundant message
From my understanding, you are trying to eliminate duplicate messages.
True there are various reasons which may cause message duplication, ranging
from message delivery and consumption. Discussion on this topic is warmly
welcome.  Had you had any idea to contribute on this issue, the developer
board is happy to discuss.

Zhanhui Li




在 2018年2月24日,上午11:17,Sohaib Iftikhar <so...@gmail.com> 写道:

My earlier email message seems to have gotten lost. So I will try again.
Please see the original message for the discussion.

Regards,
Sohaib Iftikhar

-- Man is still the most extraordinary computer of all.--

On Tue, Feb 20, 2018 at 1:54 AM, Sohaib Iftikhar <so...@gmail.com>
wrote:

Hi,

I am interested in working on this issue (https://issues.apache.org/
jira/browse/ROCKETMQ-124) as part of GSOC-18. I have a few questions for
the same. I am not sure if this discussion needs to be on the JIRA
issue or
here. Feel free to correct me if this is the wrong platform. Also while
I
have worked with distributed pub-sub systems I am still fairly new to
Rocket-MQ so maybe my understanding of it is incorrect. I apologise if
that
is the case and would be happy to stand corrected.

Following are my questions:
1. What defines a redundant message?
   The constructor that I see for a message is as follows:
   Message(String topic, String tags, String keys, int flag, byte[]
body,
boolean waitStoreMsgOK)
   Possible candidates to me are topic, tags (can there be multiple
tags?
I could not find an example for this. If yes how are they separated?),
keys
(Similar question to above.) and of course the body. Is there something
that I have missed in this? Is there something that we do not need to
consider?
2. Is their a timeline on the redundant messages? What I mean by this is
that is there a time limit after which a message with similar content is
allowed. From what I gather there was no such thing mentioned. This
would
mean storing all the messages. Depending on the requirements this may or
may not be the best solution. It might be desirable that no duplicates
are
needed within a certain time window (sliding). This allows ignoring of
duplicate messages that were generated very close to each other (or in
the
window indicated). Depending on this requirement implementation may
become
a little bit more involved.

For now, these are the only questions. I have ideas that need review
about
possible implementations but I will mention them once the specifications
are clear to me. As an end question, I would at some point like to post
design ideas to this problem privately to get it reviewed by the
development community but not make it publicly available so that it
cannot
be plagiarised. What platform/method can I use to do that? Or is
submitting
a draft to the Google platform the only possible way to accomplish this?

Thanks a lot for reading this through and looking forward to your
inputs.

Regards,
Sohaib Iftikhar