You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Rajesh Balamohan (JIRA)" <ji...@apache.org> on 2016/01/28 09:13:39 UTC
[jira] [Created] (SPARK-13059) Sort inputsplits by size in
HadoopRDD to avoid long tails
Rajesh Balamohan created SPARK-13059:
----------------------------------------
Summary: Sort inputsplits by size in HadoopRDD to avoid long tails
Key: SPARK-13059
URL: https://issues.apache.org/jira/browse/SPARK-13059
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: Rajesh Balamohan
HadoopRDD.getPartitions invokes getSplits from the inputformat and returns the HadoopPartition. There are cases where the input splits generated are not of equal sizes all the time and some splits would be much smaller than others. If bigger splits are scheduled at the end of the job, there is a possibility of getting long tail in the job. Sorting the input splits by size (in descending order) can help in scheduling the larger splits upfront. This could also help in speculation as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org