You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shushant Arora <sh...@gmail.com> on 2015/07/26 18:37:31 UTC

spark as a lookup engine for dedup

Hi

I have a requirement for processing large events but ignoring duplicate at
the same time.

Events are consumed from kafka and each event has a eventid. It may happen
that an event is already processed and came again at some other offset.

1.Can I use Spark RDD to persist processed events and then lookup with this
rdd (How to do lookup inside a RDD ?I have a JavaPairRDD<eventid,timestamp>
)
while processing new events and if event is present in  persisted rdd
ignore it , else process the even. Does rdd.lookup(key) on billion of
events will be efficient ?

2. update the rdd (Since RDD is immutable  how to update it)?

Thanks

Re: spark as a lookup engine for dedup

Posted by Romi Kuntsman <ro...@totango.com>.

RDD is immutable, it cannot be changed, you can only create a new one from
data or from transformation. It sounds inefficient to create one each 15
sec for the last 24 hours.
I think a key-value store will be much more fitted for this purpose.

On Mon, Jul 27, 2015 at 11:21 AM Shushant Arora <sh...@gmail.com>
wrote:

> its for 1 day events in range of 1 billions and processing is in streaming
> application of ~10-15 sec interval so lookup should be fast.  RDD need to
> be updated with new events and old events of current time-24 hours back
> should be removed at each processing.
>
> So is spark RDD not fit for this requirement?
>
> On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman <ro...@totango.com> wrote:
>
>> What the throughput of processing and for how long do you need to
>> remember duplicates?
>>
>> You can take all the events, put them in an RDD, group by the key, and
>> then process each key only once.
>> But if you have a long running application where you want to check that
>> you didn't see the same value before, and check that for every value, you
>> probably need a key-value store, not RDD.
>>
>> On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora <sh...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I have a requirement for processing large events but ignoring duplicate
>>> at the same time.
>>>
>>> Events are consumed from kafka and each event has a eventid. It may
>>> happen that an event is already processed and came again at some other
>>> offset.
>>>
>>> 1.Can I use Spark RDD to persist processed events and then lookup with
>>> this rdd (How to do lookup inside a RDD ?I have a
>>> JavaPairRDD<eventid,timestamp> )
>>> while processing new events and if event is present in  persisted rdd
>>> ignore it , else process the even. Does rdd.lookup(key) on billion of
>>> events will be efficient ?
>>>
>>> 2. update the rdd (Since RDD is immutable  how to update it)?
>>>
>>> Thanks
>>>
>>>
>

Re: spark as a lookup engine for dedup

Posted by Shushant Arora <sh...@gmail.com>.

its for 1 day events in range of 1 billions and processing is in streaming
application of ~10-15 sec interval so lookup should be fast.  RDD need to
be updated with new events and old events of current time-24 hours back
should be removed at each processing.

So is spark RDD not fit for this requirement?

On Mon, Jul 27, 2015 at 1:08 PM, Romi Kuntsman <ro...@totango.com> wrote:

> What the throughput of processing and for how long do you need to remember
> duplicates?
>
> You can take all the events, put them in an RDD, group by the key, and
> then process each key only once.
> But if you have a long running application where you want to check that
> you didn't see the same value before, and check that for every value, you
> probably need a key-value store, not RDD.
>
> On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora <sh...@gmail.com>
> wrote:
>
>> Hi
>>
>> I have a requirement for processing large events but ignoring duplicate
>> at the same time.
>>
>> Events are consumed from kafka and each event has a eventid. It may
>> happen that an event is already processed and came again at some other
>> offset.
>>
>> 1.Can I use Spark RDD to persist processed events and then lookup with
>> this rdd (How to do lookup inside a RDD ?I have a
>> JavaPairRDD<eventid,timestamp> )
>> while processing new events and if event is present in  persisted rdd
>> ignore it , else process the even. Does rdd.lookup(key) on billion of
>> events will be efficient ?
>>
>> 2. update the rdd (Since RDD is immutable  how to update it)?
>>
>> Thanks
>>
>>

Re: spark as a lookup engine for dedup

Posted by Romi Kuntsman <ro...@totango.com>.

What the throughput of processing and for how long do you need to remember
duplicates?

You can take all the events, put them in an RDD, group by the key, and then
process each key only once.
But if you have a long running application where you want to check that you
didn't see the same value before, and check that for every value, you
probably need a key-value store, not RDD.

On Sun, Jul 26, 2015 at 7:38 PM Shushant Arora <sh...@gmail.com>
wrote:

> Hi
>
> I have a requirement for processing large events but ignoring duplicate at
> the same time.
>
> Events are consumed from kafka and each event has a eventid. It may happen
> that an event is already processed and came again at some other offset.
>
> 1.Can I use Spark RDD to persist processed events and then lookup with
> this rdd (How to do lookup inside a RDD ?I have a
> JavaPairRDD<eventid,timestamp> )
> while processing new events and if event is present in  persisted rdd
> ignore it , else process the even. Does rdd.lookup(key) on billion of
> events will be efficient ?
>
> 2. update the rdd (Since RDD is immutable  how to update it)?
>
> Thanks
>
>