You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by S Ahmed <sa...@gmail.com> on 2013/10/01 17:19:59 UTC

use case with high rate of duplicate messages

I have a use case where thousands of servers send status type messages,
which I am currently handling real-time w/o any kind of queueing system.

So currently when I receive a message, and perform a md5 hash of the
message, perform a lookup in my database to see if this is a duplicate, if
not, I store the message.

Now the message format can be either xml or json, and the actual parsing of
the message takes time so I would am thinking of storing all the messages
in kafka first and then batch processing these messages in hopes that this
will be faster to do.

Do you think there would be a faster way of recognizing duplicate messages
this way or its just the same problem but doing it on a batch level?

Re: use case with high rate of duplicate messages

Posted by Neha Narkhede <ne...@gmail.com>.
It is only available in 0.8.1 (current trunk) which has not been released
yet. We plan to release it right after 0.8-final is out. Here are some
wikis that describe the deduplication feature -

https://cwiki.apache.org/confluence/display/KAFKA/Keyed+Messages+Proposal
https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction

Thanks,
Neha


On Tue, Oct 1, 2013 at 11:01 AM, Sybrandy, Casey <
Casey.Sybrandy@six3systems.com> wrote:

> Interesting.  I didn't know that Kafka had deduplication capabilities.
>  How do you leverage it?  Also, is it supported in Kafka 0.7.x?
>
> -----Original Message-----
> From: Guozhang Wang [mailto:wangguoz@gmail.com]
> Sent: Tuesday, October 01, 2013 11:33 AM
> To: users@kafka.apache.org
> Subject: Re: use case with high rate of duplicate messages
>
> Batch processing will increase the throughput but also increase latency,
> how large latency your real-time processing can tolerate?
>
> One thing you could try is to use the keyed messages, with key as the md5
> hash of your message. Kafka has a deduplication mechanism on the brokers
> that dedup messages with the same key. All you need to do is setting the
> dedup frequency appropriately for your use case.
>
> Guozhang
>
>
> On Tue, Oct 1, 2013 at 8:19 AM, S Ahmed <sa...@gmail.com> wrote:
>
> > I have a use case where thousands of servers send status type
> > messages, which I am currently handling real-time w/o any kind of
> queueing system.
> >
> > So currently when I receive a message, and perform a md5 hash of the
> > message, perform a lookup in my database to see if this is a
> > duplicate, if not, I store the message.
> >
> > Now the message format can be either xml or json, and the actual
> > parsing of the message takes time so I would am thinking of storing
> > all the messages in kafka first and then batch processing these
> > messages in hopes that this will be faster to do.
> >
> > Do you think there would be a faster way of recognizing duplicate
> > messages this way or its just the same problem but doing it on a batch
> level?
> >
>
>
>
> --
> -- Guozhang
>

RE: use case with high rate of duplicate messages

Posted by "Sybrandy, Casey" <Ca...@Six3Systems.com>.
Interesting.  I didn't know that Kafka had deduplication capabilities.  How do you leverage it?  Also, is it supported in Kafka 0.7.x?

-----Original Message-----
From: Guozhang Wang [mailto:wangguoz@gmail.com] 
Sent: Tuesday, October 01, 2013 11:33 AM
To: users@kafka.apache.org
Subject: Re: use case with high rate of duplicate messages

Batch processing will increase the throughput but also increase latency, how large latency your real-time processing can tolerate?

One thing you could try is to use the keyed messages, with key as the md5 hash of your message. Kafka has a deduplication mechanism on the brokers that dedup messages with the same key. All you need to do is setting the dedup frequency appropriately for your use case.

Guozhang


On Tue, Oct 1, 2013 at 8:19 AM, S Ahmed <sa...@gmail.com> wrote:

> I have a use case where thousands of servers send status type 
> messages, which I am currently handling real-time w/o any kind of queueing system.
>
> So currently when I receive a message, and perform a md5 hash of the 
> message, perform a lookup in my database to see if this is a 
> duplicate, if not, I store the message.
>
> Now the message format can be either xml or json, and the actual 
> parsing of the message takes time so I would am thinking of storing 
> all the messages in kafka first and then batch processing these 
> messages in hopes that this will be faster to do.
>
> Do you think there would be a faster way of recognizing duplicate 
> messages this way or its just the same problem but doing it on a batch level?
>



--
-- Guozhang

Re: use case with high rate of duplicate messages

Posted by Guozhang Wang <wa...@gmail.com>.
Batch processing will increase the throughput but also increase latency,
how large latency your real-time processing can tolerate?

One thing you could try is to use the keyed messages, with key as the md5
hash of your message. Kafka has a deduplication mechanism on the brokers
that dedup messages with the same key. All you need to do is setting the
dedup frequency appropriately for your use case.

Guozhang


On Tue, Oct 1, 2013 at 8:19 AM, S Ahmed <sa...@gmail.com> wrote:

> I have a use case where thousands of servers send status type messages,
> which I am currently handling real-time w/o any kind of queueing system.
>
> So currently when I receive a message, and perform a md5 hash of the
> message, perform a lookup in my database to see if this is a duplicate, if
> not, I store the message.
>
> Now the message format can be either xml or json, and the actual parsing of
> the message takes time so I would am thinking of storing all the messages
> in kafka first and then batch processing these messages in hopes that this
> will be faster to do.
>
> Do you think there would be a faster way of recognizing duplicate messages
> this way or its just the same problem but doing it on a batch level?
>



-- 
-- Guozhang