You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2011/05/06 15:49:36 UTC

MapReduce Stats calculations

MAHOUT-688 has a M/R job to calculate std. deviation for document frequencies so that it can prune noisy words.  I'm thinking of making it a bit more generic and adding a stats package to org.apache.mahout.math.hadoop that contains this and other basic stats calculations (mean, variance, sum of squares, etc.) that operate in M/R.

Is that useful or am I re-inventing the wheel here or wasting time?  Seems like such a beast should already exist, but a quick search didn't turn up much.

-Grant

Re: MapReduce Stats calculations

Posted by Grant Ingersoll <gs...@apache.org>.
Meant to send this to dev@

On May 6, 2011, at 9:58 AM, Sean Owen wrote:

> Hadoop has something like this:
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/package-summary.html

Cool, and more importantly seems to provide a framework for such pieces.  I'll try that one out too.

> 
> I find there's a very strong and unfortunate tension between
> reusability and performance in some cases. Having a discrete stage to
> compute something like this is good; if it can be computed inline in a
> prior stage and output on the side, that's a big performance savings.
> 
> I also find myself tempted to construct a bunch of M/R primitives. For
> now I am trying to restrict my thinking to refactoring pieces that can
> come out easily, and that are used already in at least one place.

I think that's in line w/ what I did on M-686:  I put in variance and std. dev. b/c it needs them.  I just put them in a place that allows others to add as we need them (along the lines of what Ted is suggesting)

> 
> I suppose I mean: if you want to write primitive X and can't find one
> good use for it yet in Mahout, I'd hold off, but otherwise would
> surely add it and use it.
> 
> 
> On Fri, May 6, 2011 at 2:49 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> MAHOUT-688 has a M/R job to calculate std. deviation for document frequencies so that it can prune noisy words.  I'm thinking of making it a bit more generic and adding a stats package to org.apache.mahout.math.hadoop that contains this and other basic stats calculations (mean, variance, sum of squares, etc.) that operate in M/R.
>> 
>> Is that useful or am I re-inventing the wheel here or wasting time?  Seems like such a beast should already exist, but a quick search didn't turn up much.
>> 
>> -Grant



Re: MapReduce Stats calculations

Posted by Ted Dunning <te...@gmail.com>.
yeah... un-re-used re-usable primitives are of little help, but a Mahout big
data equivalent of the R summary function would handy to have.  The fact is,
we already have the re-usable bits anyway.  It is common to want column-wise
summaries of big matrices.  Useful summaries include:

a) moment based statistics like average and standard deviation

b) rank based statistics like min, max, 1, 5, 25, 50, 75, 95, and 99th
percentiles.

c) counts of positive, negative and all entries

d) for word or text-like data, the total number of unique items with
frequency greater than 0, 1, 5 and the top 5-10 most common items.


On Fri, May 6, 2011 at 6:58 AM, Sean Owen <sr...@gmail.com> wrote:

> Hadoop has something like this:
>
> http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/package-summary.html
>
> I find there's a very strong and unfortunate tension between
> reusability and performance in some cases. Having a discrete stage to
> compute something like this is good; if it can be computed inline in a
> prior stage and output on the side, that's a big performance savings.
>
> I also find myself tempted to construct a bunch of M/R primitives. For
> now I am trying to restrict my thinking to refactoring pieces that can
> come out easily, and that are used already in at least one place.
>
> I suppose I mean: if you want to write primitive X and can't find one
> good use for it yet in Mahout, I'd hold off, but otherwise would
> surely add it and use it.
>
>
> On Fri, May 6, 2011 at 2:49 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> > MAHOUT-688 has a M/R job to calculate std. deviation for document
> frequencies so that it can prune noisy words.  I'm thinking of making it a
> bit more generic and adding a stats package to org.apache.mahout.math.hadoop
> that contains this and other basic stats calculations (mean, variance, sum
> of squares, etc.) that operate in M/R.
> >
> > Is that useful or am I re-inventing the wheel here or wasting time?
>  Seems like such a beast should already exist, but a quick search didn't
> turn up much.
> >
> > -Grant
>

Re: MapReduce Stats calculations

Posted by Sean Owen <sr...@gmail.com>.
Hadoop has something like this:
http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/aggregate/package-summary.html

I find there's a very strong and unfortunate tension between
reusability and performance in some cases. Having a discrete stage to
compute something like this is good; if it can be computed inline in a
prior stage and output on the side, that's a big performance savings.

I also find myself tempted to construct a bunch of M/R primitives. For
now I am trying to restrict my thinking to refactoring pieces that can
come out easily, and that are used already in at least one place.

I suppose I mean: if you want to write primitive X and can't find one
good use for it yet in Mahout, I'd hold off, but otherwise would
surely add it and use it.


On Fri, May 6, 2011 at 2:49 PM, Grant Ingersoll <gs...@apache.org> wrote:
> MAHOUT-688 has a M/R job to calculate std. deviation for document frequencies so that it can prune noisy words.  I'm thinking of making it a bit more generic and adding a stats package to org.apache.mahout.math.hadoop that contains this and other basic stats calculations (mean, variance, sum of squares, etc.) that operate in M/R.
>
> Is that useful or am I re-inventing the wheel here or wasting time?  Seems like such a beast should already exist, but a quick search didn't turn up much.
>
> -Grant