You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2016/01/28 09:13:39 UTC

[jira] [Created] (SPARK-13059) Sort inputsplits by size in HadoopRDD to avoid long tails

Rajesh Balamohan created SPARK-13059:
----------------------------------------

             Summary: Sort inputsplits by size in HadoopRDD to avoid long tails
                 Key: SPARK-13059
                 URL: https://issues.apache.org/jira/browse/SPARK-13059
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
            Reporter: Rajesh Balamohan


HadoopRDD.getPartitions invokes getSplits from the inputformat and returns the HadoopPartition.  There are cases where the input splits generated are not  of equal sizes all the time and some splits would be much smaller than others.   If bigger splits are scheduled at the end of the job, there is a possibility of getting long tail in the job.  Sorting the input splits by size (in descending order) can help in scheduling the larger splits upfront. This could also help in speculation as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org