You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Adrian Mocanu <am...@verticalscope.com> on 2014/02/11 16:07:12 UTC
more complex analytics
Hi
Are there any examples on how to do any other operation apart from counting in spark via map then reduceByKey.
It's pretty straight forward to do counts but how do I add in my own function (say conditional sum based on tuple fields or moving average)?
Here's my count example so we have some code to work with
val inputList= List( ("name","1","11134"),("name","2","11134"),("name","1","11130"), ("name2","1","11133") )
sc.parallelize( inputList )
.map(x => (x,1) )
.reduceByKey(sumTuples)
.foreach(x=>println(x))
How would I add up field 2 from tuples which have fields "name" and the last field the same.
In my example the result I want is:
"name","1+2","11134"
"name","1","11130"
"name2","1","11133"
Thanks
-A
Re: more complex analytics
Posted by Andrew Ash <an...@andrewash.com>.
I would key by those things that should be the same and then reduce by sum.
sc.parallelize(inputList)
.map(x => (x._1, x._2.toLong, x._3.toLong)) // parse to numeric values from
String
.map(x => ((x._1, x._3), x._2)) // key by the name and final number field
.reduceByKey(_+_)
Andrew
On Tue, Feb 11, 2014 at 7:07 AM, Adrian Mocanu <am...@verticalscope.com>wrote:
> Hi
>
> Are there any examples on how to do any other operation apart from
> counting in spark via map then reduceByKey.
>
> It’s pretty straight forward to do counts but how do I add in my own
> function (say conditional sum based on tuple fields or moving average)?
>
>
>
> Here’s my count example so we have some code to work with
>
>
>
> val inputList= List(
> ("name","1","11134"),("name","2","11134"),("name","1","11130"),
> ("name2","1","11133") )
>
> sc.parallelize( inputList )
>
> .map(x => (x,1) )
>
> .reduceByKey(sumTuples)
>
> .foreach(x=>println(x))
>
>
>
> How would I add up field 2 from tuples which have fields “name” and the
> last field the same.
>
> In my example the result I want is:
>
> "name","1+2","11134"
>
> “name","1","11130”
>
> “name2","1","11133”
>
>
>
> Thanks
>
> -A
>
Re: more complex analytics
Posted by purav aggarwal <pu...@gmail.com>.
sc.parallelize(inputList)
.map(x => ((x._1,x._3),x._2))
.reduceByKey(_+_)
You need to understand what's happening when you say .map(x=>(x,1))
"For every x (which is a tuple of 3 fields in your case) - you map it to a
pair with key = x and value = 1"
In .map(x => ((x._1,x._3),x._2)) - you set the key as your first and third
field and value as your second field.
On Tue, Feb 11, 2014 at 8:37 PM, Adrian Mocanu <am...@verticalscope.com>wrote:
> Hi
>
> Are there any examples on how to do any other operation apart from
> counting in spark via map then reduceByKey.
>
> It's pretty straight forward to do counts but how do I add in my own
> function (say conditional sum based on tuple fields or moving average)?
>
>
>
> Here's my count example so we have some code to work with
>
>
>
> val inputList= List(
> ("name","1","11134"),("name","2","11134"),("name","1","11130"),
> ("name2","1","11133") )
>
> sc.parallelize( inputList )
>
> .map(x => (x,1) )
>
> .reduceByKey(sumTuples)
>
> .foreach(x=>println(x))
>
>
>
> How would I add up field 2 from tuples which have fields "name" and the
> last field the same.
>
> In my example the result I want is:
>
> "name","1+2","11134"
>
> "name","1","11130"
>
> "name2","1","11133"
>
>
>
> Thanks
>
> -A
>