You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by suman bharadwaj <su...@gmail.com> on 2014/01/10 20:01:11 UTC

Help needed. Not sure how to reduceByKey works in spark

Hi,

I'm new to spark. And i needed some help in understanding how reduceByKey
works.

I have the following data:

col1                                col2   col3
1/11/2014 12:18:40 AM    123     143
1/11/2014 12:18:45 AM    123     143
1/11/2014 12:18:49 AM    123     143

the output i need is

col2  col3    totaltime(currect value of col1 - prev val of col1)
123   143        9

I'm doing the following:

map((col2,col3),col1).reduceByKey( *<here i don't know how to perform the
subtraction of dates > *)

How to perform subtraction of dates ?
How does reduceByKey work when my map emits as follows
((col2,col3),(col1,col4))?


Thanks in advance.

Re: Help needed. Not sure how to reduceByKey works in spark

Posted by Andrew Ash <an...@andrewash.com>.
So for each (col2, col3) pair, you want the difference between the earliest
col1 value and the latest col1 value?

I'd suggest something like this:

val data = sc.textFile(...).map(l => l.split("\t"))
data.map(r => ((r(1), r(2)), r(0)) // produce an RDD of ((col2, col3), col1)
    .groupByKey() // now have ((col2, col3) [col1s])
    .map(p => (p._1, (max(p._2) - min(p._2)))) // now have ((col2, col3),
diffInCol1s)

The downside of this approach is that if you have a (col2, col3) pair with
tons of col1 values, you might OOM one of your executors in the groupByKey.

Andrew


On Fri, Jan 10, 2014 at 11:01 AM, suman bharadwaj <su...@gmail.com>wrote:

> Hi,
>
> I'm new to spark. And i needed some help in understanding how reduceByKey
> works.
>
> I have the following data:
>
> col1                                col2   col3
> 1/11/2014 12:18:40 AM    123     143
> 1/11/2014 12:18:45 AM    123     143
> 1/11/2014 12:18:49 AM    123     143
>
> the output i need is
>
> col2  col3    totaltime(currect value of col1 - prev val of col1)
> 123   143        9
>
> I'm doing the following:
>
> map((col2,col3),col1).reduceByKey( *<here i don't know how to perform the
> subtraction of dates > *)
>
> How to perform subtraction of dates ?
> How does reduceByKey work when my map emits as follows
> ((col2,col3),(col1,col4))?
>
>
> Thanks in advance.
>