You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/12/31 09:47:49 UTC

[jira] [Resolved] (SPARK-4961) Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time

     [ https://issues.apache.org/jira/browse/SPARK-4961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved SPARK-4961.
------------------------------
    Resolution: Won't Fix

> Put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time
> ---------------------------------------------------------------------------------------
>
>                 Key: SPARK-4961
>                 URL: https://issues.apache.org/jira/browse/SPARK-4961
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: YanTang Zhai
>            Priority: Minor
>
> HadoopRDD.getPartitions is lazyied to process in DAGScheduler.JobSubmitted. If inputdir is large, getPartitions may spend much time.
> For example, in our cluster, it needs from 0.029s to 766.699s. If one JobSubmitted event is processing, others should wait. Thus, we
> want to put HadoopRDD.getPartitions forward to reduce DAGScheduler.JobSubmitted processing time. Then other JobSubmitted event don't
> need to wait much time. HadoopRDD object could get its partitons when it is instantiated.
> We could analyse and compare the execution time before and after optimization.
> TaskScheduler.start execution time: [time1__]
> DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or TaskScheduler.start) execution time: [time2_]
> HadoopRDD.getPartitions execution time: [time3___]
> Stages execution time: [time4_____]
> (1) The app has only one job
> (a)
> The execution time of the job before optimization is [time1__][time2_][time3___][time4_____].
> The execution time of the job after optimization is....[time1__][time3___][time2_][time4_____].
> In summary, if the app has only one job, the total execution time is same before and after optimization.
> (2) The app has 4 jobs
> (a) Before optimization,
> job1 execution time is [time2_][time3___][time4_____],
> job2 execution time is [time2__________][time3___][time4_____],
> job3 execution time is................................[time2____][time3___][time4_____],
> job4 execution time is................................[time2_____________][time3___][time4_____].
> After optimization, 
> job1 execution time is [time3___][time2_][time4_____],
> job2 execution time is [time3___][time2__][time4_____],
> job3 execution time is................................[time3___][time2_][time4_____],
> job4 execution time is................................[time3___][time2__][time4_____].
> In summary, if the app has multiple jobs, average execution time after optimization is less than before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org