You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/12 23:54:56 UTC

[GitHub] [spark] JoshRosen commented on pull request #34265: [SPARK-23626][CORE] Eagerly compute RDD.partitions on entire DAG when submitting job to DAGScheduler

JoshRosen commented on pull request #34265:
URL: https://github.com/apache/spark/pull/34265#issuecomment-941751848


   This is a longstanding issue and there's been multiple previous attempts to fix it.
   
   - #3794
   - #20770
   - #24438
   - #27234
   
   Some early attempts were rejected due to thread-safety issues with their approaches or became stale without review.
   
   This PR's approach is very similar to @ajithme's approach in #27234, with a few key differences:
   
   - I allowed exceptions to bubble instead of logging and ignoring them.
   - I used a faster and less-race-condition-prone testing approach (using the `SchedulerIntegrationSuite` framework).
   - I used a non-recursive tree-traversal method (based on similar existing methods) to avoid stack overflow errors when traversing huge DAGs.
   - I also added the fix to `submitMapStage` and `runApproximateJob`: these are much lesser used codepaths but can still potentially benefit from the fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org