You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Evan Chan (JIRA)" <ji...@apache.org> on 2016/11/15 00:40:58 UTC

[jira] [Reopened] (SPARK-8133) sticky partitions

     [ https://issues.apache.org/jira/browse/SPARK-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Evan Chan reopened SPARK-8133:
------------------------------

I think this is worth looking into again - for streaming.  My team is creating spark streaming pipelines that do aggregations.  For correctness and efficiency, if we can maintain a cache of current aggregation values across micro batches, then we can lower the load on datastores and improve performance - without making the batch size too big (which leads to other problems).    Using Kafka to partition does not solve this problem because we need to do groupBys, sorts etc on the incoming stream, so in particular we want the sorted output to be "sticky" to a particular node.

Maintaining a cache or in-memory state requires "stickiness" of partitions to nodes.  We are exploring two avenues to do this and can contribute it back.

1) By modifying the TaskSchedulerImpl, we can avoid shuffles of tasks when allocating executors/workers.  This solves stickiness for clusters where the number of executors will not change.

2) Using a custom ShuffledRDD (or derived class) which can place the shuffled data partition on the same node given the same range of keys (assume HashPartitioner with constant number of partitions).

> sticky partitions
> -----------------
>
>                 Key: SPARK-8133
>                 URL: https://issues.apache.org/jira/browse/SPARK-8133
>             Project: Spark
>          Issue Type: New Feature
>          Components: DStreams
>    Affects Versions: 1.3.1
>            Reporter: sid
>
> We are trying to replace Apache Storm with Apache Spark streaming.
> In storm; we partitioned stream based on "Customer ID" so that msgs with a range of "customer IDs" will be routed to same bolt (worker).
> We do this because each worker will cache customer details (from DB).
> So we split into 4 partitions and each bolt (worker) will have 1/4 of the entire range.
> I am hoping we have a solution to this in Spark Streaming



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org