You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Luis Ángel Vicente Sánchez <la...@gmail.com> on 2015/03/25 14:57:57 UTC

foreachRDD execution

I have a simple and probably dumb question about foreachRDD.

We are using spark streaming + cassandra to compute concurrent users every
5min. Our batch size is 10secs and our block interval is 2.5secs.

At the end of the world we are using foreachRDD to join the data in the RDD
with existing data in Cassandra, update the counters and then save it back
to Cassandra.

To the best of my understanding, in this scenario, spark streaming produces
one RDD every 10secs and foreachRDD executes them sequentially, that is,
foreachRDD would never run in parallel.

Am I right?

Regards,

Luis

Re: foreachRDD execution

Posted by Tathagata Das <td...@databricks.com>.

Yes, that is the correct understanding. There are undocumented parameters
that allow that, but I do not recommend using those :)

TD

On Wed, Mar 25, 2015 at 6:57 AM, Luis Ángel Vicente Sánchez <
langel.groups@gmail.com> wrote:

> I have a simple and probably dumb question about foreachRDD.
>
> We are using spark streaming + cassandra to compute concurrent users every
> 5min. Our batch size is 10secs and our block interval is 2.5secs.
>
> At the end of the world we are using foreachRDD to join the data in the
> RDD with existing data in Cassandra, update the counters and then save it
> back to Cassandra.
>
> To the best of my understanding, in this scenario, spark streaming
> produces one RDD every 10secs and foreachRDD executes them sequentially,
> that is, foreachRDD would never run in parallel.
>
> Am I right?
>
> Regards,
>
> Luis
>
>
>