You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Hao REN <ju...@gmail.com> on 2013/11/06 01:40:13 UTC

rdd.foreach doesn't act as expected

Hi,

Just a quick question:

When playing Spark with my toy code as below, I get some unexpected results.


*case class A(var a: Int) {*
*    def setA() = { a = 100 }*
*}*

*val as = sc.parallelize(List(A(1), A(2)))   // it is a RDD[A]*


*as.foreach(_.setA())*

*as.collect  // it gives Array[this.A] = Array(A(1), A(2))*


The result expected is Array(A(100), A(100)). I am just trying to update
the content of the objects of A which reside in RDD.

1) Does the foreach do the right thing ?
2) Which is the best way to update the object in RDD, use 'map' instead ?

Thank you.

Hao

-- 
REN Hao

Data Engineer @ ClaraVista

Paris, France

Tel:  +33 06 14 54 57 24

Re: rdd.foreach doesn't act as expected

Posted by Matei Zaharia <ma...@gmail.com>.

In general, you shouldn’t be mutating data in RDDs. That will make it impossible to recover from faults.

In this particular case, you got 1 and 2 because the RDD isn’t cached. You just get the same list you called parallelize() with each time you iterate through it. But caching it and modifying it in place would not be a good idea — use a map() to create a new RDD instead.

Matei

On Nov 6, 2013, at 5:41 PM, Hao REN <ju...@gmail.com> wrote:

> 'map' works as expected. The immutable object here is just based on the use case that the data need to be updated everyday.
> Wondering what the best way to do that. Not sure that spark supports updating well.
> 
> 
> 2013/11/6 Mohit Jaggi <mo...@ayasdi.com>
> my guess is you need to use a map for this. foreach is for side-effects and i am not sure if changing the object itself is an expected use. also, the objects are supposed to be immutable, your's isn't.
> 
> 
> On Tue, Nov 5, 2013 at 4:40 PM, Hao REN <ju...@gmail.com> wrote:
> Hi,
> 
> Just a quick question:
> 
> When playing Spark with my toy code as below, I get some unexpected results.
> 
> 
> case class A(var a: Int) {
>     def setA() = { a = 100 }
> }
> 
> val as = sc.parallelize(List(A(1), A(2)))   // it is a RDD[A]
> 
> as.foreach(_.setA())
> 
> as.collect  // it gives Array[this.A] = Array(A(1), A(2))
> 
> 
> The result expected is Array(A(100), A(100)). I am just trying to update the content of the objects of A which reside in RDD.
> 
> 1) Does the foreach do the right thing ? 
> 2) Which is the best way to update the object in RDD, use 'map' instead ?
> 
> Thank you.
> 
> Hao
> 
> -- 
> REN Hao
> 
> Data Engineer @ ClaraVista
> 
> Paris, France
> 
> Tel:  +33 06 14 54 57 24
> 
> 
> 
> 
> -- 
> REN Hao
> 
> Data Engineer @ ClaraVista
> 
> Paris, France
> 
> Tel:  +33 06 14 54 57 24

Re: rdd.foreach doesn't act as expected

Posted by Hao REN <ju...@gmail.com>.

'map' works as expected. The immutable object here is just based on the use
case that the data need to be updated everyday.
Wondering what the best way to do that. Not sure that spark supports
updating well.


2013/11/6 Mohit Jaggi <mo...@ayasdi.com>

> my guess is you need to use a map for this. foreach is for side-effects
> and i am not sure if changing the object itself is an expected use. also,
> the objects are supposed to be immutable, your's isn't.
>
>
> On Tue, Nov 5, 2013 at 4:40 PM, Hao REN <ju...@gmail.com> wrote:
>
>> Hi,
>>
>> Just a quick question:
>>
>> When playing Spark with my toy code as below, I get some unexpected
>> results.
>>
>>
>> *case class A(var a: Int) {*
>> *    def setA() = { a = 100 }*
>> *}*
>>
>> *val as = sc.parallelize(List(A(1), A(2)))   // it is a RDD[A]*
>>
>>
>> *as.foreach(_.setA())*
>>
>> *as.collect  // it gives Array[this.A] = Array(A(1), A(2))*
>>
>>
>> The result expected is Array(A(100), A(100)). I am just trying to update
>> the content of the objects of A which reside in RDD.
>>
>> 1) Does the foreach do the right thing ?
>> 2) Which is the best way to update the object in RDD, use 'map' instead ?
>>
>> Thank you.
>>
>> Hao
>>
>> --
>>  REN Hao
>>
>> Data Engineer @ ClaraVista
>>
>> Paris, France
>>
>> Tel:  +33 06 14 54 57 24
>>
>
>


-- 
REN Hao

Data Engineer @ ClaraVista

Paris, France

Tel:  +33 06 14 54 57 24

Re: rdd.foreach doesn't act as expected

Posted by Mohit Jaggi <mo...@ayasdi.com>.

my guess is you need to use a map for this. foreach is for side-effects and
i am not sure if changing the object itself is an expected use. also, the
objects are supposed to be immutable, your's isn't.


On Tue, Nov 5, 2013 at 4:40 PM, Hao REN <ju...@gmail.com> wrote:

> Hi,
>
> Just a quick question:
>
> When playing Spark with my toy code as below, I get some unexpected
> results.
>
>
> *case class A(var a: Int) {*
> *    def setA() = { a = 100 }*
> *}*
>
> *val as = sc.parallelize(List(A(1), A(2)))   // it is a RDD[A]*
>
>
> *as.foreach(_.setA())*
>
> *as.collect  // it gives Array[this.A] = Array(A(1), A(2))*
>
>
> The result expected is Array(A(100), A(100)). I am just trying to update
> the content of the objects of A which reside in RDD.
>
> 1) Does the foreach do the right thing ?
> 2) Which is the best way to update the object in RDD, use 'map' instead ?
>
> Thank you.
>
> Hao
>
> --
> REN Hao
>
> Data Engineer @ ClaraVista
>
> Paris, France
>
> Tel:  +33 06 14 54 57 24
>