You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Jacek Laskowski <ja...@japila.pl> on 2020/04/26 09:53:09 UTC
ShuffleMapStage and pendingPartitions vs isAvailable or findMissingPartitions?
Hi,
I found that ShuffleMapStage has this (apparently superfluous)
pendingPartitions registry [1] for DAGScheduler and the description says:
" /**
* Partitions that either haven't yet been computed, or that were
computed on an executor
* that has since been lost, so should be re-computed. This variable is
used by the
* DAGScheduler to determine when a stage has completed. Task successes
in both the active
* attempt for the stage or in earlier attempts for this stage can cause
paritition ids to get
* removed from pendingPartitions. As a result, this variable may be
inconsistent with the pending
* tasks in the TaskSetManager for the active attempt for the stage (the
partitions stored here
* will always be a subset of the partitions that the TaskSetManager
thinks are pending).
*/
"
I'm curious why there is a need for this pendingPartitions
since isAvailable or findMissingPartitions (using MapOutputTrackerMaster)
know it already and I think are even more up-to-date. Why is there this
extra registry?
[1]
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala#L60
Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
"The Internals Of" Online Books <https://books.japila.pl/>
Follow me on https://twitter.com/jaceklaskowski
<https://twitter.com/jaceklaskowski>
Re: ShuffleMapStage and pendingPartitions vs isAvailable or
findMissingPartitions?
Posted by ZHANG Wei <we...@outlook.com>.
AFAICT, not must have `pendingPartitions`, `mapOutputTrackerMaster` is
added by a later change, `pendingPartitions` can be cleaned up.
--
Cheers,
-z
On Sun, 26 Apr 2020 11:53:09 +0200
Jacek Laskowski <ja...@japila.pl> wrote:
> Hi,
>
> I found that ShuffleMapStage has this (apparently superfluous)
> pendingPartitions registry [1] for DAGScheduler and the description says:
>
> " /**
> * Partitions that either haven't yet been computed, or that were
> computed on an executor
> * that has since been lost, so should be re-computed. This variable is
> used by the
> * DAGScheduler to determine when a stage has completed. Task successes
> in both the active
> * attempt for the stage or in earlier attempts for this stage can cause
> paritition ids to get
> * removed from pendingPartitions. As a result, this variable may be
> inconsistent with the pending
> * tasks in the TaskSetManager for the active attempt for the stage (the
> partitions stored here
> * will always be a subset of the partitions that the TaskSetManager
> thinks are pending).
> */
> "
>
> I'm curious why there is a need for this pendingPartitions
> since isAvailable or findMissingPartitions (using MapOutputTrackerMaster)
> know it already and I think are even more up-to-date. Why is there this
> extra registry?
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapStage.scala#L60
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books <https://books.japila.pl/>
> Follow me on https://twitter.com/jaceklaskowski
>
> <https://twitter.com/jaceklaskowski>
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org