You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Adrian Mocanu <am...@verticalscope.com> on 2014/02/11 16:07:12 UTC

more complex analytics

Hi
Are there any examples on how to do any other operation apart from counting in spark via map then reduceByKey.
It's pretty straight forward to do counts but how do I add in my own function (say conditional sum based on tuple fields or moving average)?

Here's my count example so we have some code to work with

val inputList= List( ("name","1","11134"),("name","2","11134"),("name","1","11130"), ("name2","1","11133") )
sc.parallelize( inputList )
.map(x => (x,1) )
.reduceByKey(sumTuples)
.foreach(x=>println(x))

How would I add up field 2 from tuples which have fields "name" and the last field the same.
In my example the result I want is:
"name","1+2","11134"
"name","1","11130"
"name2","1","11133"

Thanks
-A

Re: more complex analytics

Posted by Andrew Ash <an...@andrewash.com>.

I would key by those things that should be the same and then reduce by sum.

sc.parallelize(inputList)
.map(x => (x._1, x._2.toLong, x._3.toLong)) // parse to numeric values from
String
.map(x => ((x._1, x._3), x._2)) // key by the name and final number field
.reduceByKey(_+_)

Andrew


On Tue, Feb 11, 2014 at 7:07 AM, Adrian Mocanu <am...@verticalscope.com>wrote:

>  Hi
>
> Are there any examples on how to do any other operation apart from
> counting in spark via map then reduceByKey.
>
> It’s pretty straight forward to do counts but how do I add in my own
> function (say conditional sum based on tuple fields or moving average)?
>
>
>
> Here’s my count example so we have some code to work with
>
>
>
> val inputList= List(
> ("name","1","11134"),("name","2","11134"),("name","1","11130"),
> ("name2","1","11133") )
>
> sc.parallelize( inputList )
>
> .map(x => (x,1) )
>
> .reduceByKey(sumTuples)
>
> .foreach(x=>println(x))
>
>
>
> How would I add up field 2 from tuples which have fields “name” and the
> last field the same.
>
> In my example the result I want is:
>
> "name","1+2","11134"
>
> “name","1","11130”
>
> “name2","1","11133”
>
>
>
> Thanks
>
> -A
>

Re: more complex analytics

Posted by purav aggarwal <pu...@gmail.com>.

sc.parallelize(inputList)
.map(x => ((x._1,x._3),x._2))
.reduceByKey(_+_)

You need to understand what's happening when you say .map(x=>(x,1))
"For every x (which is a tuple of 3 fields in your case) - you map it to a
pair with key = x and value = 1"
In .map(x => ((x._1,x._3),x._2)) - you set the key as your first and third
field and value as your second field.


On Tue, Feb 11, 2014 at 8:37 PM, Adrian Mocanu <am...@verticalscope.com>wrote:

>  Hi
>
> Are there any examples on how to do any other operation apart from
> counting in spark via map then reduceByKey.
>
> It's pretty straight forward to do counts but how do I add in my own
> function (say conditional sum based on tuple fields or moving average)?
>
>
>
> Here's my count example so we have some code to work with
>
>
>
> val inputList= List(
> ("name","1","11134"),("name","2","11134"),("name","1","11130"),
> ("name2","1","11133") )
>
> sc.parallelize( inputList )
>
> .map(x => (x,1) )
>
> .reduceByKey(sumTuples)
>
> .foreach(x=>println(x))
>
>
>
> How would I add up field 2 from tuples which have fields "name" and the
> last field the same.
>
> In my example the result I want is:
>
> "name","1+2","11134"
>
> "name","1","11130"
>
> "name2","1","11133"
>
>
>
> Thanks
>
> -A
>