You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Kent Yao (Jira)" <ji...@apache.org> on 2021/11/29 07:43:00 UTC

[jira] [Updated] (SPARK-37481) Disappearance of skipped stages mislead the bug hunting

     [ https://issues.apache.org/jira/browse/SPARK-37481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kent Yao updated SPARK-37481:
-----------------------------
    Description: 
# 
 ## With FetchFailedException and Map Stage Retries

When rerunning spark-sql shell with the original SQL in [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315]

!https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png!

1. stage 3 threw FetchFailedException and caused itself and its parent stage(stage 2) to retry
2. stage 2 was skipped before but its attemptId was still 0, so when its retry happened it got removed from `Skipped Stages` 

The DAG of Job 2 doesn't show that stage 2 is skipped anymore.

!https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png!

Besides, a retried stage usually has a subset of tasks from the original stage. If we mark it as an original one, the metrics might lead us into pitfalls.

  was:
## With FetchFailedException and Map Stage Retries

When rerunning spark-sql shell with the original SQL in https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315

![image](https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png)

1. stage 3 threw FetchFailedException and caused itself and its parent stage(stage 2) to retry
2. stage 2 was skipped before but its attemptId was still 0, so when its retry happened it got removed from `Skipped Stages` 

The DAG of Job 2 doesn't show that stage 2 is skipped anymore.

![image](https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png)


Besides, a retried stage usually has a subset of tasks from the original stage. If we mark it as an original one, the metrics might lead us into pitfalls.


> Disappearance of skipped stages mislead the bug hunting 
> --------------------------------------------------------
>
>                 Key: SPARK-37481
>                 URL: https://issues.apache.org/jira/browse/SPARK-37481
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.1.2, 3.2.0, 3.3.0
>            Reporter: Kent Yao
>            Priority: Major
>
> # 
>  ## With FetchFailedException and Map Stage Retries
> When rerunning spark-sql shell with the original SQL in [https://gist.github.com/yaooqinn/6acb7b74b343a6a6dffe8401f6b7b45c#gistcomment-3977315]
> !https://user-images.githubusercontent.com/8326978/143821530-ff498caa-abce-483d-a24b-315aacf7e0a0.png!
> 1. stage 3 threw FetchFailedException and caused itself and its parent stage(stage 2) to retry
> 2. stage 2 was skipped before but its attemptId was still 0, so when its retry happened it got removed from `Skipped Stages` 
> The DAG of Job 2 doesn't show that stage 2 is skipped anymore.
> !https://user-images.githubusercontent.com/8326978/143824666-6390b64a-a45b-4bc8-b05d-c5abbb28cdef.png!
> Besides, a retried stage usually has a subset of tasks from the original stage. If we mark it as an original one, the metrics might lead us into pitfalls.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org