You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Adrian Mocanu <am...@verticalscope.com> on 2014/03/04 17:18:58 UTC

sstream.foreachRDD

Hi
I've noticed that if in the driver of a spark app I have a foreach and add stream elements to a list from the stream, the list contains no elements at the end of the processing.

Take this sample code:
  val list= new java.util.List()
  sstream.foreachRDD (rdd => rdd.foreach( tuple => list.add(tuple) ) )

If in the add method of the list I put a print statement I see the tuples added. But when I print the list it is empty. I do wait for the stream to finish before I print the list so it's not a timing/racing issue.

I think it might have something to do  with the fact that Spark sends the List code to its nodes, adds the data to the list there, but never sends the list back to the driver program or at least in the driver program the pointer to the list does not point do the list that was over to Spark nodes and which now has the tuples from the stream

Am I supposed to use RDD.collect then add the data to the List or what's the proper way to get tuples out of Spark?
 sstream.foreachRDD (rdd => rdd.collect.foreach( tuple => list.add(tuple) ) )

Thanks
-Adrian


Re: sstream.foreachRDD

Posted by Soumya Simanta <so...@gmail.com>.
I think you need to call collect . 

> On Mar 4, 2014, at 11:18 AM, Adrian Mocanu <am...@verticalscope.com> wrote:
> 
> Hi
> I’ve noticed that if in the driver of a spark app I have a foreach and add stream elements to a list from the stream, the list contains no elements at the end of the processing.
>  
> Take this sample code:
>   val list= new java.util.List()
>   sstream.foreachRDD (rdd => rdd.foreach( tuple => list.add(tuple) ) )
>  
> If in the add method of the list I put a print statement I see the tuples added. But when I print the list it is empty. I do wait for the stream to finish before I print the list so it’s not a timing/racing issue.
>  
> I think it might have something to do  with the fact that Spark sends the List code to its nodes, adds the data to the list there, but never sends the list back to the driver program or at least in the driver program the pointer to the list does not point do the list that was over to Spark nodes and which now has the tuples from the stream
>  
> Am I supposed to use RDD.collect then add the data to the List or what’s the proper way to get tuples out of Spark?
>  sstream.foreachRDD (rdd => rdd.collect.foreach( tuple => list.add(tuple) ) )
>  
> Thanks
> -Adrian
>