You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by kant kodali <ka...@gmail.com> on 2016/09/24 06:49:44 UTC

ideas on de duplication for spark streaming?

Hi Guys,
I have bunch of data coming in to my spark streaming cluster from a message
queue(not kafka). And this message queue guarantees at least once delivery only
so there is potential that some of the messages that come in to the spark
streaming cluster are actually duplicates and I am trying to figure out a best
way to filter them ? I was thinking if I should have a hashmap as a broadcast
variable but then I saw that broadcast variables are read only. Also instead of
having a global hashmap variable across every worker node I am thinking
Distributed hash table would be a better idea. any suggestions on how best I
could approach this problem by leveraging the existing functionality?
Thanks,kant

Re: ideas on de duplication for spark streaming?

Posted by Jörn Franke <jo...@gmail.com>.
As Cody said, Spark is not going to help you here. 
There are two issues you need to look at here: duplicated (or even more) messages processed by two different processes and the case of failure of any component (including the message broker). Keep in mind that duplicated messages can even occur weeks later (e.g. Something from experience: restart of message broker and message send weeks later again). 
As said, a Dht can help, but you will have a lot of (erroneous) effort to implement it.
You may want to look at (dedicated) redis nodes. Redis has support for partitioning, is very fast (but please create only one connection/ node and not per lookup) and provides you a lot of different data structures to solve your problem (e.g. Atomic counters). 

> On 24 Sep 2016, at 08:49, kant kodali <ka...@gmail.com> wrote:
> 
> 
> Hi Guys,
> 
> I have bunch of data coming in to my spark streaming cluster from a message queue(not kafka). And this message queue guarantees at least once delivery only so there is potential that some of the messages that come in to the spark streaming cluster are actually duplicates and I am trying to figure out a best way to filter them ? I was thinking if I should have a hashmap as a broadcast variable but then I saw that broadcast variables are read only. Also instead of having a global hashmap variable across every worker node I am thinking Distributed hash table would be a better idea. any suggestions on how best I could approach this problem by leveraging the existing functionality?
> 
> Thanks,
> kant

Re: ideas on de duplication for spark streaming?

Posted by Cody Koeninger <co...@koeninger.org>.
Spark alone isn't going to solve this problem, because you have no reliable
way of making sure a given worker has a consistent shard of the messages
seen so far, especially if there's an arbitrary amount of delay between
duplicate messages.  You need a DHT or something equivalent.

On Sep 24, 2016 1:49 AM, "kant kodali" <ka...@gmail.com> wrote:

> Hi Guys,
>
> I have bunch of data coming in to my spark streaming cluster from a
> message queue(not kafka). And this message queue guarantees at least once
> delivery only so there is potential that some of the messages that come in
> to the spark streaming cluster are actually duplicates and I am trying to
> figure out a best way to filter them ? I was thinking if I should have a
> hashmap as a broadcast variable but then I saw that broadcast variables are
> read only. Also instead of having a global hashmap variable across every
> worker node I am thinking Distributed hash table would be a better idea.
> any suggestions on how best I could approach this problem by leveraging the
> existing functionality?
>
> Thanks,
> kant
>