You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)" <ab...@bloomberg.net> on 2014/10/10 16:37:34 UTC

How does the Spark Accumulator work under the covers?

Hello,
I was wondering on what does the Spark accumulator do under the covers. I’ve implemented my own associative addInPlace function for the accumulator, where is this function being run? Let’s say you call something like myRdd.map(x => sum += x) is “sum” being accumulated locally in any way, for each element or partition or node? Is “sum” a broadcast variable? Or does it only exist on the driver node? How does the driver node get access to the “sum”?
Thanks,
Areg

Re: How does the Spark Accumulator work under the covers?

Posted by Jayant Shekhar <ja...@cloudera.com>.

Hi Areg,

Check out
http://spark.apache.org/docs/latest/programming-guide.html#accumulators

val sum = sc.accumulator(0)   // accumulator created from an initial value
in the driver

The accumulator variable is created in the driver. Tasks running on the
cluster can then add to it. However, they cannot read its value. Only the
driver program can read the accumulator’s value, using its value method.

sum.value  // in the driver

> myRdd.map(x => sum += x)
> where is this function being run
This is being run by the tasks in the workers.

The driver accumulates the data from the various workers and mergers them
to get the final result as Haripriya mentioned.

Thanks,
Jayant


On Fri, Oct 10, 2014 at 7:46 AM, HARIPRIYA AYYALASOMAYAJULA <
aharipriya92@gmail.com> wrote:

> If you use parallelize, the data is distributed across multiple nodes
> available and sum is computed individually within each partition and later
> merged. The driver manages the entire process. Is my understanding correct?
> Can someone please correct me if I am wrong?
>
> On Fri, Oct 10, 2014 at 9:37 AM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -)
> <ab...@bloomberg.net> wrote:
>
>> Hello,
>> I was wondering on what does the Spark accumulator do under the covers.
>> I’ve implemented my own associative addInPlace function for the
>> accumulator, where is this function being run? Let’s say you call something
>> like myRdd.map(x => sum += x) is “sum” being accumulated locally in any
>> way, for each element or partition or node? Is “sum” a broadcast variable?
>> Or does it only exist on the driver node? How does the driver node get
>> access to the “sum”?
>> Thanks,
>> Areg
>>
>
>
>
> --
> Regards,
> Haripriya Ayyalasomayajula
>

Re: How does the Spark Accumulator work under the covers?

Posted by HARIPRIYA AYYALASOMAYAJULA <ah...@gmail.com>.

If you use parallelize, the data is distributed across multiple nodes
available and sum is computed individually within each partition and later
merged. The driver manages the entire process. Is my understanding correct?
Can someone please correct me if I am wrong?

On Fri, Oct 10, 2014 at 9:37 AM, Areg Baghdasaryan (BLOOMBERG/ 731 LEX -) <
abaghdasary2@bloomberg.net> wrote:

> Hello,
> I was wondering on what does the Spark accumulator do under the covers.
> I’ve implemented my own associative addInPlace function for the
> accumulator, where is this function being run? Let’s say you call something
> like myRdd.map(x => sum += x) is “sum” being accumulated locally in any
> way, for each element or partition or node? Is “sum” a broadcast variable?
> Or does it only exist on the driver node? How does the driver node get
> access to the “sum”?
> Thanks,
> Areg
>



-- 
Regards,
Haripriya Ayyalasomayajula