You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Faiz Halde <ha...@gmail.com> on 2022/10/14 12:54:19 UTC

[SparkListener] Calculating the total amount of re-computations / waste

Hello,

We run our spark workloads on spot and we would like to quantify the impact
of spot interruptions on our workloads. We are proposing the following
metric but would like your opinions on it

We are leveraging Spark's Event Listener and performing the following

T = task

T1 = sum(T.execution-time) for all T where T.status=failed and
T.stage-attempt-number = 0

T2 = sum(T.execution-time) for all T where T.stage-attempt-number > 0

Tall = sum(T.execution-time)

Retry% = (T1 + T2) / Tall

The assumption is that

T1 – IF a stage is executing for the first time then only tasks that failed
was waste
T2 – every task executed for a stage with stage-attempt-number > 0 is a
retry since the stage was succeeded previously

Re: [SparkListener] Calculating the total amount of re-computations / waste

Posted by Emil Ejbyfeldt <ee...@liveintent.com.INVALID>.
Hi,

I don't think the assumption for T2 is correct. For example if there is 
fetch failure (that needs recompuration in a earlier stage) for the 
first task in a stage that will cause the stage be retried and most of 
the actual work will happen with `stage-attempt-number` > 0.

I believe the correct way to do this calculation on the task level and 
look which task has run multiple times and calculate based on that. But 
even then there is things like persist that will mean that the same task 
might do different amounts of work so it might not be clear which cpu 
time should be used for the "not wasted" task.

/ Emil

On 14/10/2022 14:54, Faiz Halde wrote:
> Hello,
> 
> We run our spark workloads on spot and we would like to quantify the 
> impact of spot interruptions on our workloads. We are proposing the 
> following metric but would like your opinions on it
> 
> We are leveraging Spark's Event Listener and performing the following
> 
> T = task
> 
> T1 = sum(T.execution-time) for all T where T.status=failed and 
> T.stage-attempt-number = 0
> 
> T2 = sum(T.execution-time) for all T where T.stage-attempt-number > 0
> 
> Tall = sum(T.execution-time)
> 
> Retry% = (T1 + T2) / Tall
> 
> The assumption is that
> 
> T1 – IF a stage is executing for the first time then only tasks that 
> failed was waste
> T2 – every task executed for a stage with stage-attempt-number > 0 is a 
> retry since the stage was succeeded previously
> 

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org