You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Domingo Mihovilovic <do...@exabeam.com> on 2013/10/01 00:52:51 UTC

Spark Streaming architecture question - shared memory model

I have a quick architecture question. Imagine that you are processing a stream data at high speed and needs to build, update, and access some memory data structure where the "model" is stored.  

One option to keep store this model in a DB, but it will require a huge number of updates (assume we do not want to go to a large Cassandra ring or similar). What's the preferred way to manage this model in memory to have consistent and shared access across multiple nodes in the cluster. Is there some sort of shared memory approach or do I need to try to manage this model as an RDD?

All suggestions are welcome.

dma

Re: Spark Streaming architecture question - shared memory model

Posted by dmihovilovic <do...@exabeam.com>.

Any idea why the RDD is maintained so secretively "behind" the scenes. It looks like the only way to get the status is after updating it. There is no exposed method to just get the state and trying to get it buy applying a function that does nothing. We are doing some acrobatics to get the state but this model appears very odd.

The only example I have found so far is a simple updating of counts. Is anyone aware of a more complex examples with state updates and retrievals?

dma

On Sep 30, 2013, at 3:58 PM, Michael Malak wrote:

> Domingo Mihovilovic <do...@exabeam.com> writes:
> 
>>  Imagine that you are processing a stream data at high speed and needs to build, update,
>> and access some memory data structure where the "model" is stored.  
> 
> Normally this is done with updateStateByKey, which maintains an RDD behind the scenes.
> 
> Michael Malak
> http://www.linkedin.com/in/michaelmalak

Re: Spark Streaming architecture question - shared memory model

Posted by Michael Malak <mi...@yahoo.com>.

Domingo Mihovilovic <do...@exabeam.com> writes:

> Imagine that you are processing a stream data at high speed and needs to build, update,
> and access some memory data structure where the "model" is stored.  

Normally this is done with updateStateByKey, which maintains an RDD behind the scenes.

Michael Malak
http://www.linkedin.com/in/michaelmalak