You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sameer Agarwal (JIRA)" <ji...@apache.org> on 2018/01/08 20:54:01 UTC

[jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

     [ https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sameer Agarwal updated SPARK-15690:
-----------------------------------
    Target Version/s: 2.4.0  (was: 2.3.0)

> Fast single-node (single-process) in-memory shuffle
> ---------------------------------------------------
>
>                 Key: SPARK-15690
>                 URL: https://issues.apache.org/jira/browse/SPARK-15690
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle, SQL
>            Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their partition id, and then write the data to disk. This is not a big bottleneck because the network throughput on commodity clusters tend to be low. However, an increasing number of Spark users are using the system to process data on a single-node. When in a single node operating against intermediate data that fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort to do data shuffling on a single node, and still gracefully fallback to disk if the data size does not fit in memory. Given the number of partitions is usually small (say less than 256), it'd require only a single pass do to the radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This ticket has a smaller scope (single-process), and aims to actually productionize this code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org