You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jaonary Rabarisoa <ja...@gmail.com> on 2014/04/01 16:55:10 UTC

Use combineByKey and StatCount

Hi all;

Can someone give me some tips to compute mean of RDD by key , maybe with
combineByKey and StatCount.

Cheers,

Jaonary

Re: Use combineByKey and StatCount

Posted by Cheng Lian <li...@gmail.com>.

Not very sure about the meaning of “mean of RDD by key”, is this what you
want?

val meansByKey = rdd
  .map { case (k, v) =>
    k -> (v, 1)
  }
  .reduceByKey { (lhs, rhs) =>
    (lhs._1 + rhs._1, lhs._2 + rhs._2)
  }
  .map { case (sum, count) =>
    sum / count
  }
  .collectAsMap()

With this, you need to be careful about overflow though.

On Tue, Apr 1, 2014 at 10:55 PM, Jaonary Rabarisoa <ja...@gmail.com>wrote:

> Hi all;
>
> Can someone give me some tips to compute mean of RDD by key , maybe with
> combineByKey and StatCount.
>
> Cheers,
>
> Jaonary
>

Re: Use combineByKey and StatCount

Posted by dachuan <hd...@gmail.com>.

it seems you can imitate RDD.top()'s implementation. for each partition,
you get the number of records, and the total sum of key, and in the final
result handler, you add all the sum together, and add the number of records
together, then you can get the mean, I mean, arithmetic mean.

On Tue, Apr 1, 2014 at 10:55 AM, Jaonary Rabarisoa <ja...@gmail.com>wrote:

> Hi all;
>
> Can someone give me some tips to compute mean of RDD by key , maybe with
> combineByKey and StatCount.
>
> Cheers,
>
> Jaonary
>

-- 
Dachuan Huang
Cellphone: 614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210