You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Segerlind, Nathan L" <na...@intel.com> on 2014/11/17 04:06:39 UTC

RDD.aggregate versus accumulables...

Hi All.

I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead.

So...

What's the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other?

Thanks,
Nate

Re: RDD.aggregate versus accumulables...

Posted by Surendranauth Hiraman <su...@velos.io>.

We use Algebird for calculating things like min/max, stddev, variance, etc.

https://github.com/twitter/algebird/wiki

-Suren


On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann <da...@velos.io>
wrote:

> You should *never* use accumulators for this purpose because you may get
> incorrect answers. Accumulators can count the same thing multiple times -
> you cannot rely upon the correctness of the values they compute. See
> SPARK-732 <https://issues.apache.org/jira/browse/SPARK-732> for more info.
>
> On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L <
> nathan.l.segerlind@intel.com> wrote:
>
>>  Hi All.
>>
>>
>>
>> I am trying to get my head around why using accumulators and accumulables
>> seems to be the most recommended method for accumulating running sums,
>> averages, variances and the like, whereas the aggregate method seems to me
>> to be the right one. I have no performance measurements as of yet, but it
>> seems that aggregate is simpler and more intuitive (And it does what one
>> might expect an accumulator to do) whereas the accumulators and
>> accumulables seem to have some extra complications and overhead.
>>
>>
>>
>> So…
>>
>>
>>
>> What’s the real difference between an accumulator/accumulable and
>> aggregating an RDD? When is one method of aggregation preferred over the
>> other?
>>
>>
>>
>> Thanks,
>>
>> Nate
>>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 54 W 40th St, New York, NY 10018
> E: daniel.siegmann@velos.io W: www.velos.io
>



-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Flashback: RDD.aggregate versus accumulables...

Posted by jiml <ji...@megalearningllc.com>.

And Lord Joe you were right future versions did protect accumulators in
actions. I wonder if anyone has a "modern" take on the accumulator vs.
aggregate question. Seems like if I need to do it by key or control
partitioning I would use aggregate.

Bottom line question / reason for post: I wonder if anyone has more ideas
about using aggregate instead? Am I right to think accumulables are always
present on the driver, whereas an aggregate needs to be pulled to the driver
manually?

Details: 

But they both give me an option to write custom adds and merges:
For example this class I am stubbing out:

    class DropEvalAccumulableParam implements
AccumulableParam<DropEvaluation, DropResult> {

        // Add additional data to the accumulator value. Is allowed to
modify and return r for efficiency (to avoid allocating objects).
        // r is the first value
        @Override
        public DropEvaluation addAccumulator(DropEvaluation dropEvaluation,
DropResult dropResult) {
            return null;
        }

        // Merge two accumulated values together. Is allowed to modify and
return the first value for efficiency (to avoid allocating objects).
        @Override
        public DropEvaluation addInPlace(DropEvaluation masterDropEval,
DropEvaluation r1) {
            return null;
        }

        // Return the "zero" (identity) value for an accumulator type, given
its initial value. For example, if R was a vector of N dimensions,
        // this would return a vector of N zeroes.
        @Override
        public DropEvaluation zero(DropEvaluation dropEvaluation) {
            // technically the "additive identity" of a DropEvaluation would
be


            return dropEvaluation;
        }
    }





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p26456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: RDD.aggregate versus accumulables...

Posted by lordjoe <lo...@gmail.com>.

I have been playing with using accumulators (despite the possible error with
multiple attempts) These provide a convenient way to get some numbers while
still performing business logic. 
I posted some sample code at http://lordjoesoftware.blogspot.com/.
Even if accumulators are not perfect today - future versions may improve
them and they are great ways to monitor execution and get a sense of
performance on lazily executed systems



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-aggregate-versus-accumulables-tp19044p19102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

RE: RDD.aggregate versus accumulables...

Posted by "Segerlind, Nathan L" <na...@intel.com>.

Thanks for the link to the bug.

Unfortunately, using accumulators like this is getting spread around as a recommended practice despite the bug.

From: Daniel Siegmann [mailto:daniel.siegmann@velos.io]
Sent: Monday, November 17, 2014 8:32 AM
To: Segerlind, Nathan L
Cc: user
Subject: Re: RDD.aggregate versus accumulables...

You should never use accumulators for this purpose because you may get incorrect answers. Accumulators can count the same thing multiple times - you cannot rely upon the correctness of the values they compute. See SPARK-732<https://issues.apache.org/jira/browse/SPARK-732> for more info.

On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L <na...@intel.com>> wrote:
Hi All.

I am trying to get my head around why using accumulators and accumulables seems to be the most recommended method for accumulating running sums, averages, variances and the like, whereas the aggregate method seems to me to be the right one. I have no performance measurements as of yet, but it seems that aggregate is simpler and more intuitive (And it does what one might expect an accumulator to do) whereas the accumulators and accumulables seem to have some extra complications and overhead.

So…

What’s the real difference between an accumulator/accumulable and aggregating an RDD? When is one method of aggregation preferred over the other?

Thanks,
Nate

--
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegmann@velos.io<ma...@velos.io> W: www.velos.io<http://www.velos.io>

Re: RDD.aggregate versus accumulables...

Posted by Daniel Siegmann <da...@velos.io>.

You should *never* use accumulators for this purpose because you may get
incorrect answers. Accumulators can count the same thing multiple times -
you cannot rely upon the correctness of the values they compute. See
SPARK-732 <https://issues.apache.org/jira/browse/SPARK-732> for more info.

On Sun, Nov 16, 2014 at 10:06 PM, Segerlind, Nathan L <
nathan.l.segerlind@intel.com> wrote:

>  Hi All.
>
>
>
> I am trying to get my head around why using accumulators and accumulables
> seems to be the most recommended method for accumulating running sums,
> averages, variances and the like, whereas the aggregate method seems to me
> to be the right one. I have no performance measurements as of yet, but it
> seems that aggregate is simpler and more intuitive (And it does what one
> might expect an accumulator to do) whereas the accumulators and
> accumulables seem to have some extra complications and overhead.
>
>
>
> So…
>
>
>
> What’s the real difference between an accumulator/accumulable and
> aggregating an RDD? When is one method of aggregation preferred over the
> other?
>
>
>
> Thanks,
>
> Nate
>

-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

54 W 40th St, New York, NY 10018
E: daniel.siegmann@velos.io W: www.velos.io