You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by Russ Weeks <rw...@newbrightidea.com> on 2014/07/11 09:38:24 UTC

Calculating averages with eg. StatsCombiner

Hi,

I'd like to understand this paragraph in the Accumulo manual a little
better:

"The only restriction on an combining iterator is that the combiner
developer should not assume that all values for a given key have been seen,
since new mutations can be inserted at anytime. This precludes using the
total number of values in the aggregation such as when calculating an
average, for example."

By "using the total number of values in the aggregation", I presume that it
means inside the combiner's reduce method? Because it seems like if I'm
using the example StatsCombiner registered on all 3 scopes, after the scan
completes the count and the sum fields should be consistent (w.r.t each
other, of course new mutations could have been added since the scan
started) and if I divide the two I'll get an accurate average, right?

Thanks,
-Russ

Re: Calculating averages with eg. StatsCombiner

Posted by Russ Weeks <rw...@newbrightidea.com>.
Thanks, Billie, that clears things up.
-Russ


On Tue, Jul 15, 2014 at 11:44 AM, Billie Rinaldi <bi...@gmail.com>
wrote:

> Yes, any individual scan should be able to calculate an accurate average
> based on the entries present at the time of the scan.  You just can't
> pre-compute an average, but you can pre-compute the sum and count and do
> the division on the fly.  For averaging, finishing up the calculation is
> trivial, but it is a simple example of a reducer that loses information
> when calculating its result: there is no function f(avg(v_0, ... ,v_N),
> v_new) that equals avg(v_0, ... ,v_N, v_new) when you don't know N.  You
> would not want a combiner that loses information to run during major or
> minor compaction scopes.
>
>
> On Fri, Jul 11, 2014 at 12:38 AM, Russ Weeks <rw...@newbrightidea.com>
> wrote:
>
>> Hi,
>>
>> I'd like to understand this paragraph in the Accumulo manual a little
>> better:
>>
>> "The only restriction on an combining iterator is that the combiner
>> developer should not assume that all values for a given key have been seen,
>> since new mutations can be inserted at anytime. This precludes using the
>> total number of values in the aggregation such as when calculating an
>> average, for example."
>>
>> By "using the total number of values in the aggregation", I presume that
>> it means inside the combiner's reduce method? Because it seems like if I'm
>> using the example StatsCombiner registered on all 3 scopes, after the scan
>> completes the count and the sum fields should be consistent (w.r.t each
>> other, of course new mutations could have been added since the scan
>> started) and if I divide the two I'll get an accurate average, right?
>>
>> Thanks,
>> -Russ
>>
>
>

Re: Calculating averages with eg. StatsCombiner

Posted by Billie Rinaldi <bi...@gmail.com>.
Yes, any individual scan should be able to calculate an accurate average
based on the entries present at the time of the scan.  You just can't
pre-compute an average, but you can pre-compute the sum and count and do
the division on the fly.  For averaging, finishing up the calculation is
trivial, but it is a simple example of a reducer that loses information
when calculating its result: there is no function f(avg(v_0, ... ,v_N),
v_new) that equals avg(v_0, ... ,v_N, v_new) when you don't know N.  You
would not want a combiner that loses information to run during major or
minor compaction scopes.


On Fri, Jul 11, 2014 at 12:38 AM, Russ Weeks <rw...@newbrightidea.com>
wrote:

> Hi,
>
> I'd like to understand this paragraph in the Accumulo manual a little
> better:
>
> "The only restriction on an combining iterator is that the combiner
> developer should not assume that all values for a given key have been seen,
> since new mutations can be inserted at anytime. This precludes using the
> total number of values in the aggregation such as when calculating an
> average, for example."
>
> By "using the total number of values in the aggregation", I presume that
> it means inside the combiner's reduce method? Because it seems like if I'm
> using the example StatsCombiner registered on all 3 scopes, after the scan
> completes the count and the sum fields should be consistent (w.r.t each
> other, of course new mutations could have been added since the scan
> started) and if I divide the two I'll get an accurate average, right?
>
> Thanks,
> -Russ
>