You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by wxhsdp <wx...@gmail.com> on 2014/04/08 14:09:57 UTC

Only TraversableOnce?

In my application, data parts inside an RDD partition have ralations. so I
need to do some operations beween them. 

for example
RDD T1 has several partitions, each partition has three parts A, B and C.
then I transform T1 to T2. after transform, T2 also has three parts D, E and
F, D = A+B, E = A+C, F = B+C. As far as I know, spark only supports
operations traversing the RDD and calling a function for each element. how
can I do such a transform?

in hadoop I copy the data in each partition to a user defined buffer and do
any operations I like in the buffer, finally I call output.collect() to emit
the data. But how can I construct a new RDD with distributed partitions in
spark? makeRDD only distributes a local Scala collection to form an RDD.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Only TraversableOnce?

Posted by wxhsdp <wx...@gmail.com>.

thank you for your help! let me have a try


Nan Zhu wrote
> If that’s the case, I think mapPartition is what you need, but it seems
> that you have to load the partition into the memory as whole by toArray
> 
> rdd.mapPartition{D => {val p = D.toArray; ...}}  
> 
> --  
> Nan Zhu
> 
> 
> 
> On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote:
> 
>> yes, how can i do this conveniently? i can use filter, but there will be
>> so
>> many RDDs and it's not concise
>>  
>>  
>>  
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>> (http://Nabble.com).
>>  
>>





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3877.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Only TraversableOnce?

Posted by Nan Zhu <zh...@gmail.com>.

Yeah, should be right 

-- 
Nan Zhu


On Wednesday, April 9, 2014 at 8:54 PM, wxhsdp wrote:

> thank you, it works
> after my operation over p, return p.toIterator, because mapPartitions has
> iterator return type, is that right?
> rdd.mapPartitions{D => {val p = D.toArray; ...; p.toIterator}}
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p4043.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com).
> 
>

Re: Only TraversableOnce?

Posted by wxhsdp <wx...@gmail.com>.

thank you, it works
after my operation over p, return p.toIterator, because mapPartitions has
iterator return type, is that right?
rdd.mapPartitions{D => {val p = D.toArray; ...; p.toIterator}}



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p4043.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Only TraversableOnce?

Posted by Nan Zhu <zh...@gmail.com>.

If that’s the case, I think mapPartition is what you need, but it seems that you have to load the partition into the memory as whole by toArray

rdd.mapPartition{D => {val p = D.toArray; ...}}  

--  
Nan Zhu



On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote:

> yes, how can i do this conveniently? i can use filter, but there will be so
> many RDDs and it's not concise
>  
>  
>  
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com).
>  
>

Re: Only TraversableOnce?

Posted by wxhsdp <wx...@gmail.com>.

yes, how can i do this conveniently? i can use filter, but there will be so
many RDDs and it's not concise



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Only TraversableOnce?

Posted by Nan Zhu <zh...@gmail.com>.

so, the data structure looks like:

D consists of D1, D2, D3 (DX is partition)

and 

DX consists of d1, d2, d3 (dx is the part in your context)?

what you want to do is to transform 

DX to (d1 + d2, d1 + d3, d2 + d3)?



Best, 

-- 
Nan Zhu



On Tuesday, April 8, 2014 at 8:09 AM, wxhsdp wrote:

> In my application, data parts inside an RDD partition have ralations. so I
> need to do some operations beween them. 
> 
> for example
> RDD T1 has several partitions, each partition has three parts A, B and C.
> then I transform T1 to T2. after transform, T2 also has three parts D, E and
> F, D = A+B, E = A+C, F = B+C. As far as I know, spark only supports
> operations traversing the RDD and calling a function for each element. how
> can I do such a transform?
> 
> in hadoop I copy the data in each partition to a user defined buffer and do
> any operations I like in the buffer, finally I call output.collect() to emit
> the data. But how can I construct a new RDD with distributed partitions in
> spark? makeRDD only distributes a local Scala collection to form an RDD.
> 
> 
> 
> 
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com (http://Nabble.com).
> 
>