You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Kira <me...@gmail.com> on 2016/01/13 14:43:34 UTC

Read Accumulator value while running

Hi, 

So i have an action on one RDD that is relatively long, let's call it ac1;
what i want to do is to execute another action (ac2) on the same RDD to see
the evolution of the first one (ac1); for this end i want to use an
accumulator and read it's value progressively to see the changes on it (on
the fly) while ac1 is always running. My problem is that the accumulator is
only updated once the ac1 has been finished, this is not helpful for me :/ . 

I ve seen  here
<http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-td15758.html>  
what may seem like a solution for me but it doesn t work : "While Spark
already offers support for asynchronous reduce (collect data from workers,
while not interrupting execution of a parallel transformation) through
accumulator" 

Another post suggested to use SparkListner to do that. 

are these solutions correct ? if yes, give me a simple exemple ? 
are there other solutions ? 

thank you. 
Regards



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Read-Accumulator-value-while-running-tp25960.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Read Accumulator value while running

Posted by Mennour Rostom <me...@gmail.com>.

Hi Daniel, Andrew

Thank you for your answers, So it s not possible to read the accumulator
value until the action that manipulate it finishes. it's bad, I ll think to
something else. However the main important thing in my application is the
ability to lunche 2 (or more) actions in parallel and concurrently :  "
*within* each Spark application, multiple “jobs” (Spark actions) may be
running concurrently if they were submitted by different threads" [Job
Scheduling Official].

my actions have to run on the same RDD : eg.

RDD D;
D.ac1(func1) [foreach fo instance]  //  D.ac2(func2) [foreachePartition or
whatever]


can this be done by Asynchronous Actions ? (it is not working for me on a
single node :/ )
can I use a broadcast (same) variable in the two partions ? if yes, what
happens if I change the value of the broadcast variable ?

in the other hand, i would realy like to know more about Spark 2.0

Thank you,
Regards

2016-01-13 20:31 GMT+01:00 Andrew Or <an...@databricks.com>:

> Hi Kira,
>
> As you suspected, accumulator values are only updated after the task
> completes. We do send accumulator updates from the executors to the driver
> on periodic heartbeats, but these only concern internal accumulators, not
> the ones created by the user.
>
> In short, I'm afraid there is not currently a way (in Spark 1.6 and
> before) to access the accumulator values until after the tasks that updated
> them have completed. This will change in Spark 2.0, the next version,
> however.
>
> Please let me know if you have more questions.
> -Andrew
>
> 2016-01-13 11:24 GMT-08:00 Daniel Imberman <da...@gmail.com>:
>
>> Hi Kira,
>>
>> I'm having some trouble understanding your question. Could you please
>> give a code example?
>>
>>
>>
>> From what I think you're asking there are two issues with what you're
>> looking to do. (Please keep in mind I could be totally wrong on both of
>> these assumptions, but this is what I've been lead to believe)
>>
>> 1. The contract of an accumulator is that you can't actually read the
>> value as the function is performing because the values in the accumulator
>> don't actually mean anything until they are reduced. If you were looking
>> for progress in a local context, you could do mapPartitions and have a
>> local accumulator per partition, but I don't think it's possible to get the
>> actual accumulator value in the middle of the map job.
>>
>> 2. As far as performing ac2 while ac1 is "always running", I'm pretty
>> sure that's not possible. The way that lazy valuation works in Spark, the
>> transformations have to be done serially. Having it any other way would
>> actually be really bad because then you could have ac1 changing the data
>> thereby making ac2's output unpredictable.
>>
>> That being said, with a more specific example it might be possible to
>> help figure out a solution that accomplishes what you are trying to do.
>>
>> On Wed, Jan 13, 2016 at 5:43 AM Kira <me...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> So i have an action on one RDD that is relatively long, let's call it
>>> ac1;
>>> what i want to do is to execute another action (ac2) on the same RDD to
>>> see
>>> the evolution of the first one (ac1); for this end i want to use an
>>> accumulator and read it's value progressively to see the changes on it
>>> (on
>>> the fly) while ac1 is always running. My problem is that the accumulator
>>> is
>>> only updated once the ac1 has been finished, this is not helpful for me
>>> :/ .
>>>
>>> I ve seen  here
>>> <
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-td15758.html
>>> >
>>> what may seem like a solution for me but it doesn t work : "While Spark
>>> already offers support for asynchronous reduce (collect data from
>>> workers,
>>> while not interrupting execution of a parallel transformation) through
>>> accumulator"
>>>
>>> Another post suggested to use SparkListner to do that.
>>>
>>> are these solutions correct ? if yes, give me a simple exemple ?
>>> are there other solutions ?
>>>
>>> thank you.
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Read-Accumulator-value-while-running-tp25960.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>

Re: Read Accumulator value while running

Posted by Andrew Or <an...@databricks.com>.

Hi Kira,

As you suspected, accumulator values are only updated after the task
completes. We do send accumulator updates from the executors to the driver
on periodic heartbeats, but these only concern internal accumulators, not
the ones created by the user.

In short, I'm afraid there is not currently a way (in Spark 1.6 and before)
to access the accumulator values until after the tasks that updated them
have completed. This will change in Spark 2.0, the next version, however.

Please let me know if you have more questions.
-Andrew

2016-01-13 11:24 GMT-08:00 Daniel Imberman <da...@gmail.com>:

> Hi Kira,
>
> I'm having some trouble understanding your question. Could you please give
> a code example?
>
>
>
> From what I think you're asking there are two issues with what you're
> looking to do. (Please keep in mind I could be totally wrong on both of
> these assumptions, but this is what I've been lead to believe)
>
> 1. The contract of an accumulator is that you can't actually read the
> value as the function is performing because the values in the accumulator
> don't actually mean anything until they are reduced. If you were looking
> for progress in a local context, you could do mapPartitions and have a
> local accumulator per partition, but I don't think it's possible to get the
> actual accumulator value in the middle of the map job.
>
> 2. As far as performing ac2 while ac1 is "always running", I'm pretty sure
> that's not possible. The way that lazy valuation works in Spark, the
> transformations have to be done serially. Having it any other way would
> actually be really bad because then you could have ac1 changing the data
> thereby making ac2's output unpredictable.
>
> That being said, with a more specific example it might be possible to help
> figure out a solution that accomplishes what you are trying to do.
>
> On Wed, Jan 13, 2016 at 5:43 AM Kira <me...@gmail.com> wrote:
>
>> Hi,
>>
>> So i have an action on one RDD that is relatively long, let's call it ac1;
>> what i want to do is to execute another action (ac2) on the same RDD to
>> see
>> the evolution of the first one (ac1); for this end i want to use an
>> accumulator and read it's value progressively to see the changes on it (on
>> the fly) while ac1 is always running. My problem is that the accumulator
>> is
>> only updated once the ac1 has been finished, this is not helpful for me
>> :/ .
>>
>> I ve seen  here
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-td15758.html
>> >
>> what may seem like a solution for me but it doesn t work : "While Spark
>> already offers support for asynchronous reduce (collect data from workers,
>> while not interrupting execution of a parallel transformation) through
>> accumulator"
>>
>> Another post suggested to use SparkListner to do that.
>>
>> are these solutions correct ? if yes, give me a simple exemple ?
>> are there other solutions ?
>>
>> thank you.
>> Regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Read-Accumulator-value-while-running-tp25960.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Re: Read Accumulator value while running

Posted by Daniel Imberman <da...@gmail.com>.

Hi Kira,

I'm having some trouble understanding your question. Could you please give
a code example?

>From what I think you're asking there are two issues with what you're
looking to do. (Please keep in mind I could be totally wrong on both of
these assumptions, but this is what I've been lead to believe)

1. The contract of an accumulator is that you can't actually read the value
as the function is performing because the values in the accumulator don't
actually mean anything until they are reduced. If you were looking for
progress in a local context, you could do mapPartitions and have a local
accumulator per partition, but I don't think it's possible to get the
actual accumulator value in the middle of the map job.

2. As far as performing ac2 while ac1 is "always running", I'm pretty sure
that's not possible. The way that lazy valuation works in Spark, the
transformations have to be done serially. Having it any other way would
actually be really bad because then you could have ac1 changing the data
thereby making ac2's output unpredictable.

That being said, with a more specific example it might be possible to help
figure out a solution that accomplishes what you are trying to do.

On Wed, Jan 13, 2016 at 5:43 AM Kira <me...@gmail.com> wrote:

> Hi,
>
> So i have an action on one RDD that is relatively long, let's call it ac1;
> what i want to do is to execute another action (ac2) on the same RDD to see
> the evolution of the first one (ac1); for this end i want to use an
> accumulator and read it's value progressively to see the changes on it (on
> the fly) while ac1 is always running. My problem is that the accumulator is
> only updated once the ac1 has been finished, this is not helpful for me :/
> .
>
> I ve seen  here
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-td15758.html
> >
> what may seem like a solution for me but it doesn t work : "While Spark
> already offers support for asynchronous reduce (collect data from workers,
> while not interrupting execution of a parallel transformation) through
> accumulator"
>
> Another post suggested to use SparkListner to do that.
>
> are these solutions correct ? if yes, give me a simple exemple ?
> are there other solutions ?
>
> thank you.
> Regards
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Read-Accumulator-value-while-running-tp25960.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>