You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by yh18190 <yh...@gmail.com> on 2014/03/25 00:59:48 UTC

Splitting RDD and Grouping together to perform computation

Hi,I have large data set of numbers ie RDD and wanted to perform a
computation only on groupof two values  at a time.For
example1,2,3,4,5,6,7... is an RDDCan i group the RDD into
(1,2),(3,4),(5,6)...?? and perform the respective computations ?in an
efficient manner?As we do'nt have a way to index elements directly using
forloop etc..(i,i+1)...is their way to resolve this problem?Please suggest
me ..i would be really thankful to you



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Splitting RDD and Grouping together to perform computation

Posted by yh18190 <yh...@gmail.com>.

Hi Andriana,

Ofcourse u can sortbykey but after that when u perform mapparttion it doesnt
guarantee that 1st partition has all those eleement in order as of original
sequence..I think we need a partitioner such that it partitions the sequence
maintaining order...

Could anyone help me in defining custom partitioner for this scenario..so
that first 12 elements goes to 1st partition and next 12 elements goes to
second parttion...



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3455.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Splitting RDD and Grouping together to perform computation

Posted by Adrian Mocanu <am...@verticalscope.com>.

Not sure how to change your code because you'd need to generate the keys where you get the data. Sorry about that.
I can tell you where to put the code to remap and sort though.

import org.apache.spark.rdd.OrderedRDDFunctions
val res2=reduced_hccg.map(_._2) 
.map( x=> (newkey,x)).sortByKey(true)  //and if you want remap them to remove the key that you used for sorting: .map(x=> x._2)

res2.foreach(println)
    val result= res2.mapPartitions(p=>{
   val l=p.toList
   
   val approx=new ListBuffer[(Int)]
   val detail=new ListBuffer[Double]
   for(i<-0 until l.length-1 by 2)
   {
    println(l(i),l(i+1))
    approx+=(l(i),l(i+1))
   
     
   }
   approx.toList.iterator
   detail.toList.iterator
 })
result.foreach(println)

-----Original Message-----
From: yh18190 [mailto:yh18190@gmail.com] 
Sent: March-28-14 5:17 PM
To: user@spark.incubator.apache.org
Subject: RE: Splitting RDD and Grouping together to perform computation

Hi Andriana,

Thanks for suggestion.Could you please modify my code part where I need to do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply appropriately..i would be thankful to you.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3452.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Splitting RDD and Grouping together to perform computation

Posted by yh18190 <yh...@gmail.com>.

Hi Andriana,

Thanks for suggestion.Could you please modify my code part where I need to
do so..I apologise for inconvinience ,becoz i am new to spark I coudnt apply
appropriately..i would be thankful to you.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3452.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Splitting RDD and Grouping together to perform computation

Posted by "Syed A. Hashmi" <sh...@cloudera.com>.

>From the jist of it, it seems like you need to override the default
partitioner to control how your data is distributed among partitions. Take
a look at different Partitioners available (Default, Range, Hash) if none
of these get you desired result, you might want to provide your own.


On Fri, Mar 28, 2014 at 2:08 PM, Adrian Mocanu <am...@verticalscope.com>wrote:

> I say you need to remap so you have a key for each tuple that you can sort
> on.
> Then call rdd.sortByKey(true) like this mystream.transform(rdd =>
> rdd.sortByKey(true))
> For this fn to be available you need to import
> org.apache.spark.rdd.OrderedRDDFunctions
>
> -----Original Message-----
> From: yh18190 [mailto:yh18190@gmail.com]
> Sent: March-28-14 5:02 PM
> To: user@spark.incubator.apache.org
> Subject: RE: Splitting RDD and Grouping together to perform computation
>
>
> Hi,
> Here is my code for given scenario.Could you please let me know where to
> sort?I mean on what basis we have to sort??so that they maintain order in
> partition as thatof original sequence..
>
> val res2=reduced_hccg.map(_._2)// which gives RDD of numbers
> res2.foreach(println)
>     val result= res2.mapPartitions(p=>{
>    val l=p.toList
>
>    val approx=new ListBuffer[(Int)]
>    val detail=new ListBuffer[Double]
>    for(i<-0 until l.length-1 by 2)
>    {
>     println(l(i),l(i+1))
>     approx+=(l(i),l(i+1))
>
>
>    }
>    approx.toList.iterator
>    detail.toList.iterator
>  })
> result.foreach(println)
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3450.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

RE: Splitting RDD and Grouping together to perform computation

Posted by Adrian Mocanu <am...@verticalscope.com>.

I say you need to remap so you have a key for each tuple that you can sort on.
Then call rdd.sortByKey(true) like this mystream.transform(rdd => rdd.sortByKey(true))
For this fn to be available you need to import org.apache.spark.rdd.OrderedRDDFunctions

-----Original Message-----
From: yh18190 [mailto:yh18190@gmail.com] 
Sent: March-28-14 5:02 PM
To: user@spark.incubator.apache.org
Subject: RE: Splitting RDD and Grouping together to perform computation

Hi,
Here is my code for given scenario.Could you please let me know where to sort?I mean on what basis we have to sort??so that they maintain order in partition as thatof original sequence..

val res2=reduced_hccg.map(_._2)// which gives RDD of numbers
res2.foreach(println)
    val result= res2.mapPartitions(p=>{
   val l=p.toList

   val approx=new ListBuffer[(Int)]
   val detail=new ListBuffer[Double]
   for(i<-0 until l.length-1 by 2)
   {
    println(l(i),l(i+1))
    approx+=(l(i),l(i+1))

   }
   approx.toList.iterator
   detail.toList.iterator
 })
result.foreach(println)

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Splitting RDD and Grouping together to perform computation

Posted by yh18190 <yh...@gmail.com>.

Hi,
Here is my code for given scenario.Could you please let me know where to
sort?I mean on what basis we have to sort??so that they maintain order in
partition as thatof original sequence..

val res2=reduced_hccg.map(_._2)// which gives RDD of numbers
res2.foreach(println)
    val result= res2.mapPartitions(p=>{
   val l=p.toList
   
   val approx=new ListBuffer[(Int)]
   val detail=new ListBuffer[Double]
   for(i<-0 until l.length-1 by 2)
   {
    println(l(i),l(i+1))
    approx+=(l(i),l(i+1))
   
     
   }
   approx.toList.iterator
   detail.toList.iterator
 })
result.foreach(println)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: Splitting RDD and Grouping together to perform computation

Posted by Adrian Mocanu <am...@verticalscope.com>.

I think you should sort each RDD

-----Original Message-----
From: yh18190 [mailto:yh18190@gmail.com] 
Sent: March-28-14 4:44 PM
To: user@spark.incubator.apache.org
Subject: Re: Splitting RDD and Grouping together to perform computation

Hi,
Thanks Nanzhu.I tried to implement your suggestion on following scenario.I have RDD of say 24 elements.In that when i partioned into two groups of 12 elements each.Their is loss of order of elements in partition.Elemest are partitioned randomly.I need to preserve the order such that the first 12 elements should be 1st partition and 2nd 12 elemts should be in 2nd partition.
Guys please help me how to main order of original sequence even after partioning....Any solution????
Before Partition:RDD
64
29186
16059
9143
6439
6155
9187
18416
25565
30420
33952
38302
43712
47092
48803
52687
56286
57471
63429
70715
75995
81878
80974
71288
48556
After Partition:In group1 with 12 elements 64, 29186,
18416
30420
33952
38302
43712
47092
56286
81878
80974
71288
48556



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3447.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Splitting RDD and Grouping together to perform computation

Posted by yh18190 <yh...@gmail.com>.

Hi,
Thanks Nanzhu.I tried to implement your suggestion on following scenario.I
have RDD of say 24 elements.In that when i partioned into two groups of 12
elements each.Their is loss of order of elements in partition.Elemest are
partitioned randomly.I need to preserve the order such that the first 12
elements should be 1st partition and 2nd 12 elemts should be in 2nd
partition.
Guys please help me how to main order of original sequence even after
partioning....Any solution????
Before Partition:RDD
64
29186
16059
9143
6439
6155
9187
18416
25565
30420
33952
38302
43712
47092
48803
52687
56286
57471
63429
70715
75995
81878
80974
71288
48556
After Partition:In group1 with 12 elements
64,
29186,
18416
30420
33952
38302
43712
47092
56286
81878
80974
71288
48556



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3447.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Splitting RDD and Grouping together to perform computation

Posted by Nan Zhu <zh...@gmail.com>.

I didn’t group the integers, but process them in group of two,   

partition that

scala> val a = sc.parallelize(List(1, 2, 3, 4), 2)  
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:12


process each partition and process elements in the partition in group of 2

scala> a.mapPartitions(p => {val l = p.toList;   
     | val ret = new ListBuffer[Int]
     | for (i <- 0 until l.length by 2) {
     | ret += l(i) + l(i + 1)
     | }
     | ret.toList.iterator
     | }
     | )
res7: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at mapPartitions at <console>:16



scala> res7.collect

res10: Array[Int] = Array(3, 7)

Best,

--  
Nan Zhu



On Monday, March 24, 2014 at 8:40 PM, Nan Zhu wrote:

> partition your input into even number partitions  
>  
> use mapPartition to operate on Iterator[Int]
>  
> maybe there are some more efficient way….
>  
> Best,  
>  
> --  
> Nan Zhu
>  
>  
>  
> On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote:
>  
> > Hi, I have large data set of numbers ie RDD and wanted to perform a computation only on groupof two values at a time. For example 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? and perform the respective computations ?in an efficient manner? As we do'nt have a way to index elements directly using forloop etc..(i,i+1)...is their way to resolve this problem? Please suggest me ..i would be really thankful to you  
> > View this message in context: Splitting RDD and Grouping together to perform computation (http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html)
> > Sent from the Apache Spark User List mailing list archive (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com (http://Nabble.com).
>

Re: Splitting RDD and Grouping together to perform computation

Posted by Nan Zhu <zh...@gmail.com>.

partition your input into even number partitions  

use mapPartition to operate on Iterator[Int]

maybe there are some more efficient way….

Best,  

--  
Nan Zhu



On Monday, March 24, 2014 at 7:59 PM, yh18190 wrote:

> Hi, I have large data set of numbers ie RDD and wanted to perform a computation only on groupof two values at a time. For example 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...?? and perform the respective computations ?in an efficient manner? As we do'nt have a way to index elements directly using forloop etc..(i,i+1)...is their way to resolve this problem? Please suggest me ..i would be really thankful to you  
> View this message in context: Splitting RDD and Grouping together to perform computation (http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html)
> Sent from the Apache Spark User List mailing list archive (http://apache-spark-user-list.1001560.n3.nabble.com/) at Nabble.com (http://Nabble.com).

Re: Splitting RDD and Grouping together to perform computation

Posted by yh18190 <yh...@gmail.com>.

We need some one who can explain us with short code snippet on given example
so that we get clear cut idea  on RDDs indexing..
Guys please help us



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153p3158.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Splitting RDD and Grouping together to perform computation

Posted by Walrus theCat <wa...@gmail.com>.

I'm also interested in this.


On Mon, Mar 24, 2014 at 4:59 PM, yh18190 <yh...@gmail.com> wrote:

> Hi, I have large data set of numbers ie RDD and wanted to perform a
> computation only on groupof two values at a time. For example
> 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...??
> and perform the respective computations ?in an efficient manner? As we
> do'nt have a way to index elements directly using forloop etc..(i,i+1)...is
> their way to resolve this problem? Please suggest me ..i would be really
> thankful to you
> ------------------------------
> View this message in context: Splitting RDD and Grouping together to
> perform computation<http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-and-Grouping-together-to-perform-computation-tp3153.html>
> Sent from the Apache Spark User List mailing list archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at Nabble.com.
>